symoon11 commented 11 months ago

Thank you for your interesting work on stabilizing transformers in reinforcement learning. I have some questions regarding the Crafter experiment.

Based on my calculations, AMAGO trains an agent through a total of 40 million environment steps (8 agents X 2000 steps X 2500 epochs). Could you please confirm if my understanding is correct?
Regarding Table 1 in Section C.5.2 of your paper, I would like to clarify whether the success rates are evaluated based on test episodes collected after training, or on all training episodes like the original Crafter paper?
I set both the context length and the maximum episode length to 2000. Given that most episodes typically span up to 500 steps, this setting implies that the transformer might attend previous episodes. However, I could not identify any part of the code that limits the attention mask to the current episode. Is this an intended design choice?

Once again, thank you for your significant contributions.

jakegrigsby commented 11 months ago

Hi, thanks!

Q1: Yes, that's correct. It's worth noting that the sample efficiency numbers in some of these experiments are very untuned. There are way too many hyperparameters to grid search this many environments. The paper is more concerned with creating fair ablations and a method that learns stably at any reasonable setting.

Q2: They are test episodes. The success rates are computed by loading the best checkpoint and then iterating through a list of single-goal tasks (CrafterEnv.set_fixed_task) instead of randomly generating them like is done during training. Each task was evaluated over the parallel actors for many episodes (20k timesteps per actor if I remember correctly).

Q3: In the code the context length becomes an upper bound on the sequence length (max_seq_len). Training sequences are padded to the length of the longest one in the batch. So in crafter the sequence will only start dropping the oldest timestep after 2k, but most episodes will never reach that limit. Most other environments have a fixed time limit so the max_seq_len = context length. (Edit: to clarify, the Transformer never attends to rollouts from previous environments because learning uses variable sequence lengths.)

If you are specifically working on Crafter, you might want to wait a bit to get started. I made changes to this open-source version of the codebase to make inference run faster and the training scripts easier to use. I still need to bring back some features for Crafter. Observations need to become multi-modal dicts again and I am missing the Embedding TstepEncoder. I'll reply to this again to let you know when they are back.

jakegrigsby commented 10 months ago

9 now replicates all the main Crafter results including the evaluation process. It was verified by retraining a seed of the more expensive pixel-based version used in the Appendix. There is now a demonstration jupyter notebook (`examples/crafter_pixels_demo.ipynb`) to visualize gameplay of a checkpoint from this run. It has a 55% success rate on the full task distribution and matches the single-goal numbers in Table 2.

UT-Austin-RPL / amago

Questions regarding Crafter experiment #7