refactor and setup training on atari

[x] ~~move from batch first to time first to make the code slightly more readable~~ not worth the effort now
[x] sample at once from replay buffer
[x] ~~need a quick e2e training test so that I make sure I don't break stuff when refactoring~~
- also an opportunity to make a slightly nicer training interface?
- unclear how to do that tho bc currently it takes ~1hr to train cartpole using the SM model.
- I should try with a way smaller model, see if I can get it to converge in <5min
  - def not, it run faster but seems to converge slower, still takes 30min++
- but even 5min is too long? it could run on PRs tho
- best I can do so far is ~40 min on cartpole
[x] read and improve imported code:
- TwoHotEncoding distrib, ema computation, what else
- add typing if needed
- remove duplicated code etc
- remove lines
[x] pass conv layer args to make_cnn
[ ] investigate why learning is slowing down when the replay buffer is full! -> haven't had the issue in a while?
[x] make code more configurable
- [x] allow images or vector obs
[x] ~~remove debug scripts~~ -> wait a little bit longer just in case, wait til we've trained the robot, also remove test envs if needed
[x] ~~memap the replay buffer?~~ -> 100k atari steps fits on 64GB RAM, ok for now
[ ] double check how we line up actions and avantages, ask question to till
- still confused but models are converging
[x] do we need the actor output distrib to be a unimix distrib?
- i think not bc they say the unimix is to avoid spike in the kl loss
[x] ~~add back first obs to zt~~ -> eh not super important i feel we can lose one obs?? investigate later if needed
[x] ~~add layer norm to CNN?~~ -> not super important we mostly want to train on the pendulum so
[x] script to load scores from official repo, load our scores from wandb and compare them?/plot them next to each others
- would be nice to add a cartpole benchmark, I was looking for that when implementing this
[x] nicely setup training script to config env and wrappers from yaml configs
- have the inference script load the model and configs from wandb and instantiate the env and wrappers from that

I am confused about how to align the advantage with the actions. Since we predict the reward and compute the lambda value and the advantage from the world model state ht+zt and since ht+zt is the result of the action at-1 I think the advantage at t should be used to push the log prob of the action that is responsible for this advantage, and that's at-1. That's how the code should look like:

    policy = actor(
        sg(hts[:, :-2]),
        sg(zts[:, :-2]),
    )
    logpi = policy.log_prob(sg(ats[:, :-2]).squeeze(-1))
    actor_loss = -logpi * sg(advantage[:, 1:].squeeze(-1))  # offset the advantage by one
    actor_entropy = policy.entropy()
    actor_loss -= ACTOR_ENTROPY * actor_entropy
    actor_loss = actor_loss * sg(traj_weight[:, :-2])
    actor_loss = actor_loss.mean()

But doing this the training collapses (wandb run): W B Chart 3_6_2024, 3_16_44 PM However, if I use the advantage at t to push the log prob of at:

    policy = actor(
        sg(hts[:, :-1]),
        sg(zts[:, :-1]),
    )
    logpi = policy.log_prob(sg(ats[:, :-1]).squeeze(-1))
    actor_loss = -logpi * sg(advantage.squeeze(-1))
    actor_entropy = policy.entropy()
    actor_loss -= ACTOR_ENTROPY * actor_entropy
    actor_loss = actor_loss * sg(traj_weight[:, :-1])
    actor_loss = actor_loss.mean()

It now works (wandb run): W B Chart 3_6_2024, 3_17_45 PM (1)

[x] clean up env instantiation
[x] don't make obs a tensor in atari wrapper, use cv2 to resize
[x] match atari env setup to dreamer codebase
- grayscale? -> no
- 84x84? -> 64x64
- max_length? -> set to 108k in dreamerv3 official codebase, we want to train for 100k step tops so no need to worry about that at least for now
[x] extract wandb config to configs
[ ] make script to load model from artifact, get a first obs, gen a traj in latent space, reconstruct obs and show/save them
[x] log models to wandb artifacts
[x] record and save video of agent at the end of training
[x] train on pacman, compare wall time and ep return to sheeprl readme bench
[ ] go over #TODOs, gauge if I should implement/investigate them
[ ] go over comments, clarify them if needed, remove them if not needed anymore
[ ] go over function and class definition, add types and docstring when needed

ok so I trained on pong overnight, for 150k step with a training ratio of 1024. It learns some stuff but it seems more unstable than the official scores and reaches a way lower return (-10 instead of 20). Why is that?

are they averaging the scores over a few episodes?
are they using the actor distrib mode instead of sampling an action from it?
- when running inference using mode instead of sample we don't get better results
is it because of the cnn encoder? should I add layer norm to it? maybe i should double check if i use the right parameters for the layers? I should definitely viz the imagined trajectories
is it because of the env pre-processing?
- in dreamerv3 they repeat the action over 4 frames and take the max between frame 0 and frame 2
- do they do that because atari games sometimes use two frames to display all of the info, taking advantage of old screens? is the emulator doing the same?
- in efficient zero I think they only take the first frame so maybe that's not it?
- no actually they stack the 4 frames!

todo:

[x] train using Danijar's wrapper
[ ] viz imagined traj
[x] if both things above don't fix it, add layer norm to cnn

Armandpl / dreamerv3

refactor and setup training on atari #10