Closed symoon11 closed 9 months ago
Hi, thanks!
Q1: Yes, that's correct. It's worth noting that the sample efficiency numbers in some of these experiments are very untuned. There are way too many hyperparameters to grid search this many environments. The paper is more concerned with creating fair ablations and a method that learns stably at any reasonable setting.
Q2: They are test episodes. The success rates are computed by loading the best checkpoint and then iterating through a list of single-goal tasks (CrafterEnv.set_fixed_task
) instead of randomly generating them like is done during training. Each task was evaluated over the parallel actors for many episodes (20k timesteps per actor if I remember correctly).
Q3: In the code the context length becomes an upper bound on the sequence length (max_seq_len
). Training sequences are padded to the length of the longest one in the batch. So in crafter the sequence will only start dropping the oldest timestep after 2k, but most episodes will never reach that limit. Most other environments have a fixed time limit so the max_seq_len = context length. (Edit: to clarify, the Transformer never attends to rollouts from previous environments because learning uses variable sequence lengths.)
If you are specifically working on Crafter, you might want to wait a bit to get started. I made changes to this open-source version of the codebase to make inference run faster and the training scripts easier to use. I still need to bring back some features for Crafter. Observations need to become multi-modal dicts again and I am missing the Embedding
TstepEncoder
. I'll reply to this again to let you know when they are back.
examples/crafter_pixels_demo.ipynb
) to visualize gameplay of a checkpoint from this run. It has a 55% success rate on the full task distribution and matches the single-goal numbers in Table 2.
Thank you for your interesting work on stabilizing transformers in reinforcement learning. I have some questions regarding the Crafter experiment.
Once again, thank you for your significant contributions.