DreamerV3: (resume) training entierly without environment interaction?

defrag-bambino commented 5 months ago

Hi,

I have the following use-case: I am running an experiment where I initially train a DreamerV3 agent normally, but afterwards (or after a certain time) want to stop interactions witht the real environment entierly. Which theoretically would mean setting the replay_ratio to infinity.

I have tried using the "checkpoint.resume_from" feature and modifying the replay_ratio to be a larger value (i.e. 10 or 100). However, I have the feeling this is not achieving what I would like it to do, since at the start of a new training run the replay_ratio starts out around zero and only converges towards the desired value after a while. May using "learning_starts" (as explained in https://github.com/Eclectic-Sheep/sheeprl/issues/273) help?

In summary, I would like to take a pre-trained DreamerV3 checkpoint and continue training it only through dreaming.

Thanks!

P.S.: As a side-question - what about the exact opposite? If I wanted to train ONLY through real environment interaction, could I just set replay_ratio=0?

belerico commented 5 months ago

Hi @defrag-bambino, the first thing that you ask can be done: you can find a branch here where you can modify the replay-ratio after resuming without incurring some OOM issues (due to sampling too many trajs from the buffer). Since this is a patch, logging will not work as expected: it has to be modified to depend on the training steps instead of the policy-steps.

I don't think that I have fully understood your second question: if you set the replay-ratio to 0 you will never train at all...Maybe you can set the imagination horizon to 1 instead of 15 (cfg.algo.horizon=0) and training the actor without any imagination (never tried, so expect some errors...)

defrag-bambino commented 5 months ago

Thanks!

Ok, the second question was more of a shower thought. Essentially asking if it is possible to remove the "dreaming" entirely, i.e. remove the world-model that creates artificial training data and only keep the behavior learning. Of course this entirely defeats the purpose of Dreamer, but I was just curious what would happen if you ran it as a regular off-policy algorithm.

You can close the issue if you like.

Eclectic-Sheep / sheeprl

DreamerV3: (resume) training entierly without environment interaction? #300