Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric
https://eclecticsheep.ai
Apache License 2.0
274 stars 26 forks source link

dv3 Imagination notebook re-visited #285

Closed anthony0727 closed 1 month ago

anthony0727 commented 1 month ago

I can't find a button to reopen the issue, regarding #272, I have two questions!

1) shouldn't initial stochastic_state be flattened s.t.

stochastic_state = player.stochastic_state.view(1, 1, -1).clone()

in

    if i == initial_steps - imagination_steps - 1:
        stochastic_state = player.stochastic_state.clone()
        recurrent_state = player.recurrent_state.clone()

2) shouldn't 0.5 also be added to imagined reconstructed obs? s.t.

        step_data["rgb"] = rec_obs["rgb"].unsqueeze(0).detach().cpu().numpy() + 0.5

in

 # imagination step
        stochastic_state, recurrent_state = world_model.rssm.imagination(stochastic_state, recurrent_state, actions)
        # update current state
        imagined_latent_states = torch.cat((stochastic_state.view(1, 1, -1), recurrent_state), -1)
        rec_obs = world_model.observation_model(imagined_latent_states)
        step_data["rgb"] = rec_obs["rgb"].unsqueeze(0).detach().cpu().numpy()
anthony0727 commented 1 month ago

3) Also -0.5 is missing here. This is quite critical in the test phase. when testing with the notebook, I couldn't get similar return seen at training, and this was the main cause

                preprocessed_obs[k] = preprocessed_obs[k] / 255.0 - 0.5

in

 preprocessed_obs[k] = torch.as_tensor(v[np.newaxis], dtype=torch.float32, device=fabric.device)
            if k in cfg.algo.cnn_keys.encoder:
                preprocessed_obs[k] = preprocessed_obs[k] / 255.0 
        mask = {k: v for k, v in preprocessed_obs.items() if k.startswith("mask")}

4) shouldn't below fixes should be made to align buffers' timeline? I'm a bit confused in this though. I think -1 for actions is correct,

            # actions actually played by the agent
            actions = torch.tensor(
                rb_initial["actions"][-imagination_steps + i - 1], # (anthony) subtract 1
                device=fabric.device,
                dtype=torch.float32,
            )[None]

I think the original -imagination_steps + i is correct with -1 for actions, but instead I got desired trajectory with -imagination_steps + i + 1... confusing

        # reconstruct the observations from the latent states used when interacting with the environment
        played_latent_states = torch.cat(
            (
                torch.tensor(rb_initial["stochastic_state"][-imagination_steps + i + 1], device=fabric.device), # (anthony) add 1
                torch.tensor(rb_initial["recurrent_state"][-imagination_steps + i + 1], device=fabric.device), # (anthony) add 1
            ),
            -1,
        )
anthony0727 commented 1 month ago

With the changes above, now I got the desired performance & results!

michele-milesi commented 1 month ago

Hi @anthony0727,

  1. The player has the stochastic state that is flattened, as you can see here: https://github.com/Eclectic-Sheep/sheeprl/blob/40035066a55b76fd9f9dc4d92ee5a749e079e6b1/sheeprl/algos/dreamer_v3/agent.py#L655 or https://github.com/Eclectic-Sheep/sheeprl/blob/40035066a55b76fd9f9dc4d92ee5a749e079e6b1/sheeprl/algos/dreamer_v3/agent.py#L686
  2. You are right, the reconstructed observation should be incremented by 0.5
  3. Same as 2., we modified Dreamer and these two modifications escaped us.
  4. +1 is correct because there is a mismatch between actions and latent states (due to the inclusion of the initial latent state in the buffer). The problem with +1 is that the last action is the one in position 0.

I will fix these problems as soon as possible, thanks for reporting these.

EDIT:

  1. As it stands, latent states at index i are generated by the observation at index i-1 and are used to generate the action at index i-1. This is because the initial latent state is buffered. I will modify it because it only creates confusion.
belerico commented 1 month ago

@anthony0727 @michele-milesi thank you both! Effectively we save the stochastic and recurrent state that refers to the previous action instead of the one that has been generated

michele-milesi commented 1 month ago

Hi @anthony0727, we created a branch for fixing this issue, can you check if it works? (https://github.com/Eclectic-Sheep/sheeprl/tree/fix/dv3-imagination-notebook)

Thanks

anthony0727 commented 1 month ago

I'll soon look into it & new dv3 after my deadline thing!!