utterances-bot commented 3 years ago

MuZero Intuition

Posts and writings by Julian Schrittwieser

http://www.furidamu.org/blog/2020/12/22/muzero-intuition/

mosicr commented 3 years ago

DeepMind blog link is dangling.

Mononofu commented 3 years ago

Thanks, should be fixed once the CDN has propagated!

mosicr commented 3 years ago

Not a problem, thank you for your paper !! Und .. had a quick look at your life section :-) and advice on index funds - here is what famous Sam Zell says ( my painful experience too btw, might save you a dime or two if you reconsider ):

"Supposedly, if your index fund matches the market, you've succeeded. But if that market is going down like an elevator, I am not sure that's much success." https://markets.businessinsider.com/currencies/news/billionaire-investor-sam-zell-questions-tesla-bitcoin-work-from-home-2020-12-1029910959 Onwards and upwards !!

ESRogs commented 3 years ago

Each time an action is selected, we increment its associated visit count n(s,a), for use in the UCB scaling factor c and for later action selection.

What is UCB?

KozukiOden commented 3 years ago

@Julien,

MuZero's model is incredibly similar to a generative Markov-blanket based model we have been developing to describe a complete different usecase.

Emotion cognition in animals.

Can we set up 30min to chat. I believe that there are immediate optimizations available to mu-zero based on this.

KozukiOden commented 3 years ago

Mostly, we need help and any help would be much greatful

julien commented 3 years ago

@KozukiOden no idea about what you said

pnorridge commented 3 years ago

A nice side effect of having an internal model: I found that (for more deterministic environments than Go), you can get the algorithm to predict a few moves in advance pretty successfully.

GoingMyWay commented 3 years ago

@ESRogs

It is Upper Confidence Bound.

KozukiOden commented 3 years ago

Here's a question - do these two things look the same?

Selecting the top action that optimally opposes entropy given energy constraints.
MuZero+n selecting the top action based on sum of average rewards across all predictions?

I'd go on to say then that attenuating energy between branching prediction states is important.

But if we go further, theres a very neat trick we can do where the prediction tree we traverse is just an embedding of our action space and at that point it becomes a path finding problem for maximizing your prediction.

We have a specific education application we're building a similar architecture for. We should chat. I think it'll be interesting for you and me.

On Fri., Dec. 25, 2020, 1:14 a.m. Julien Castelain, < notifications@github.com> wrote:

@KozukiOden https://github.com/KozukiOden no idea about what you said

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Mononofu/furidamu-comments/issues/7#issuecomment-751216568, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2W67FJFGVKYDUDECM5SJLSWRJYPANCNFSM4VH727CA .

dbsxdbsx commented 3 years ago

@julien, After reading paper and your blog. Is it correct to say that Muzero for no-perfect env (game) is only beneficial over model free algo by using planning skill when acting?
In detail , what I mean is that it seems the model only gives planning skill for agent in no-perfect model env as an advantange when acting, but doesn't really offer higher sample-efficiency, compared with classic model-free algo----the Reanalysis part in Muzero is what I think the same skill like offline learning with replay buffer in model free algo, so I don't think the Reanalysis is an advantage from model.

If I am wrong, tell me plz, thanks.

Mononofu / furidamu-comments

blog/2020/12/22/muzero-intuition/ #7

MuZero Intuition