Adding Improvements from EfficientZero

shermansiu commented 1 year ago

Even though lc0 is based off the AlphaZero architecture, would it be possible to implement some of the improvements introduced in the EfficientZero paper (which itself builds off of MuZero, which uses a learned latent dynamics function instead of hard-coded dynamics)?

The three improvements introduced by the paper can be added to AlphaZero to improve its sample efficiency.

This paper has been around for a while, so I thought it would have been added to lc0 already, but it's been a while and at a first glance, it seems like this has not been explored yet.

Blog post: https://www.lesswrong.com/posts/mRwJce3npmzbKfxws/efficientzero-how-it-works#4__EfficientZero__What_It_Changes
Paper: https://arxiv.org/abs/2111.00210
Official Code: https://github.com/YeWR/EfficientZero

shermansiu commented 1 year ago

I checked Discord and the last time it was discussed (not in #off-topic) was shortly after its publication in 2021.

TLDR: Latent self-consistency only applies if you have a learned state representation, limited data is not a problem, haven't seriously considered using techniques to improve Leela

I disagree with this sentiment: it's still worth trying in an experiment, IMO.

Tinker — 12/01/2021 1:50 PM Source code for the Efficient Zero paper/project has been released (link to paper on the Github page) GitHub - YeWR/EfficientZero: Open-source codebase for EfficientZero...

23 ...Qg3 — 12/01/2021 1:51 PM Huh, so they did post it in a month, as they said they would

Tinker — 12/01/2021 1:52 PM Yep. I signed up to be on the mailing list to be notified.

KarelPeeters — 12/01/2021 2:46 PM Is there something useful for LC0 in there? It seems to mostly be about improving the trained model and fixing issues arising from an inaccurate model, and those things are not an issue for LC0.

ghostway — 12/01/2021 11:24 PM Are they? Why would we want a 40b net if our 30b is perfect 😉

masterkni6 — 12/01/2021 11:30 PM limited data is not something we have 😛

Tilps — 12/01/2021 11:59 PM I'm waiting for someone to read that paper and have an 'aha' movement and propose a formula for transforming our recorded policy target into a 'better' policy target. but I don't think the paper as it stands is directly applicable.

ghostway — 12/02/2021 12:16 AM Hmm It seems like it can help us at least at the start of runs So they propose doing some self supervised thing called "SimSiam"

Latent Self-consistency loss model architecture

ghostway — 12/02/2021 12:24 AM That's the first thing tho, they propose three 😅 The second thing if I understand correctly is to predict the W Hmm

ghostway — 12/02/2021 12:31 AM And they propose this as the value target. Have to figure out what is that u Ok that u is "the reward from the replay buffer"

ghostway — 12/02/2021 2:23 AM thoughts?

KarelPeeters — 12/02/2021 6:56 AM See but the SimSiam thing is about improving the observation -> state prediction thing, but LC0 just gets the fully correct state as an input. I should have been more clear, earlier with "model" I meant the state prediction from observations and actions.

ghostway — 12/02/2021 6:58 AM Im just stating the paper havent given any serious brain power for it yet

mooskagh commented 1 year ago

Hi, thanks for filing the issue!

Just FYI, filing a GitHub issue for anything other than a bug usually doesn't lead to anyone implementing the idea. Discussing the topic in chat has slightly higher chances of success.

Another, and more important, aspect of open source projects is that there are very few active contributors, and they have very little time and their own "personal queue" of ideas/tasks.

The most straightforward way to check an idea is to prototype it yourself and share the results, but I agree that this is too complicated for people who aren't familiar with the code.

So, the other option is to try to get it into the active contributors' "personal queue". The best way to do this (with still low chances of success) is to start a conversation in the chat, with the concrete idea (spoken in chess/Lc0 terms) and maybe some intuition about why it will work.

Just posting a paper and saying "please reread the paper, there are good thoughts there" is unlikely to get anyone engaged.

I hope this helps!

mooskagh commented 1 year ago

Ok, I've reread the paper about "three improvements", and can confirm that none of them is aplicable to chess/Lc0:

The "Self Supervised Consistency Loss" makes the state computation (given previous state and action) be learnable faster. Given that we have a perfect game model (movegen) and no "state" representation in MuZero terms, that's not applicable.
"Value Prefix" (the averaging of the immediate rewared) doesn't make sense because there's no immediate reward in chess, only the final reward (win of the game).
Similarly, "Off-Policy Correction" (variable immediate reward span) doesn't make sense because there's no immediate reward.

I'm closing the issue for now, feel free to reopen if you have any further comments.

LeelaChessZero / lc0

Adding Improvements from EfficientZero #1910