YeWR / EfficientZero

Open-source codebase for EfficientZero, from "Mastering Atari Games with Limited Data" at NeurIPS 2021.
GNU General Public License v3.0
847 stars 131 forks source link

Zero score on Freeway #23

Open emailweixu opened 2 years ago

emailweixu commented 2 years ago

I tried to run the code for Atari Freeway using the following command with the default settings in the code:

python main.py --env FreewayNoFrameskip-v4 \
--case atari \
--opr train \
--amp_type torch_amp \
--num_gpus 1 \
--num_cpus 10 \
--cpu_actor 2 \
--gpu_actor 2 \
--force \
--object_store_memory 21474836480 \
--seed=0

I tried two seeds 0 and 1. Based on tensorboard curves, the algorithm seems to receive no reward at all for training. Both workers.ori_reward and Train_statistics.target_value_prefix_mean are constant zero from beginning to the end.

From train_test_log, seed 0 got positive reward (~7.5) at step 0, but then no reward at all after that. Seed 1 also got ~7.5 reward at step 0, while got 0 for the remaining half of the evaluations. The other half got 21.34.

I wonder whether I did something wrong.

Thanks

Wei

rPortelas commented 2 years ago

Strengthening the relevance of @emailweixu reproducibility issue

Here are my performance results on Freeway, 4 seeds: freeway_4seeds

The 4 seeds obtained a score of 0 by the end of training, however 1 seed did manage to reacher 21.5 reward at some points during training.

I used the provided train.sh script (so 4gpus), with the following modifications to fit my setup: I used "--object_store_memory 100000000000" and "--num_cpus 80", which should not impact performance.

This issue is related to issue https://github.com/YeWR/EfficientZero/issues/21 , which points out another reproducibility issue. See issue https://github.com/YeWR/EfficientZero/issues/21 for potential reasons.

Best, Rémy

emailweixu commented 2 years ago

@rPortelas Actually, I have reasons to believe that zero score for Freeway is expected. If you play Freeway yourself, you can see that it needs consistent exploration for one direction (UP) for many steps in order to get any reward. However, for the current implementation of EfficientZero, the behavior policy is a stochastic policy based on MCTS result. And at the beginning of training, the policy from MCTS is close to uniform given how EfficientZero is initialized (i.e. zero initialization for last layer of prediction nets), which makes it very hard to consistently go UP. Other algorithms such as CURL or SPR uses a greedy policy (coupled with noisy net) and are more likely to have consistent exploration behavior.

rPortelas commented 2 years ago

@emailweixu It is true that Freeway is challenging in terms of exploration, however in both the EfficientMuzero paper and the original Muzero paper (check Table S1 in appendix), non-zero performance improvements are reported. So we should be able to reproduce it.

emailweixu commented 2 years ago

@rPortelas I know both EfficientZero and MuZero reported reasonable performance on Freeway. The original MuZero is not opensourced so I cannot re-run the experiments and cannot know for sure. But since it trained on much more frames (20B frames), it is more likely to be able to obtain reward though random exploration. Furthermore, the original MuZero paper didn't describe how the weights of the models are initialized, it is possible that non-zero initialization of the last prediction layer can get some reward (non-zero initialization can make the initial policy not uniformly random). In fact, I did try non-zero initialization with EfficientZero (change init_zero to False from True), it did get some reward during the training, but the final performance is still much lower than the reported number. But zero initialization is explicitly described by EfficientZero in A.1.

szrlee commented 2 years ago

Thanks for the discussion! Any follow-up message so far?

emailweixu commented 2 years ago

@rPortelas did you try the "raw" version you mentioned in #21 on Freeway?