resume_from weights but with modified config?

Eclectic-Sheep / sheeprl

Distributed Reinforcement Learning accelerated by Lightning Fabric

https://eclecticsheep.ai

Apache License 2.0

322 stars 33 forks source link

resume_from weights but with modified config? #295

Closed defrag-bambino closed 4 months ago

defrag-bambino commented 6 months ago

Currently, when running with checkpoint.resume_from, it will take the config settings from the previous run. Correct? Is it also possible to overwrite (some) of the settings for the new run - i.e. the total_steps.

Thanks

belerico commented 6 months ago

Hi @defrag-bambino! Yes, the configs are taken from the previous run. As specified from here, only a few settings can be overridden for the new run i.e., learning_starts, root_dir and run_name. Everything else, even if you modify it from CLI, it should not be overwritten.

defrag-bambino commented 6 months ago

I see. Why is this the case? Of course some settings should/could not be changed for a second run, but for the majority this should not cause any issues, or does it?

belerico commented 5 months ago

At first, it was just a matter of simplifying the implementation. I think that we can let some hyperparameters to be changed upon resuming from a checkpoint. The safest are:

total_steps
learning_starts (already available)
prefill_steps

The less safe are the ones regarding batch-sizes, gradient steps and related and the Fabric's one: the number of parallel processes influence the learning dynamics as well as the batch-sizes and gradient steps. To modify those we need to make sure that everything runs the same even when we modify these parameters

nshoman commented 5 months ago

I was about to open another issue before spotting this, but you can't currently change total_steps when using resume_from.

Currently cli.py doesn't support modification of total_steps as it just retains the old config total_steps. I would think extending an experiment is a common use case, so I would suggest modifying cli.py to also include old_cfg.algo.pop("total_steps",None).

belerico commented 5 months ago

Yeah, you're right: that's the only modification. I'll open a PR asap

belerico commented 5 months ago

I've created a branch here where you can modify also the total_steps