Closed defrag-bambino closed 4 months ago
Hi @defrag-bambino!
Yes, the configs are taken from the previous run.
As specified from here, only a few settings can be overridden for the new run i.e., learning_starts
, root_dir
and run_name
. Everything else, even if you modify it from CLI, it should not be overwritten.
I see. Why is this the case? Of course some settings should/could not be changed for a second run, but for the majority this should not cause any issues, or does it?
At first, it was just a matter of simplifying the implementation. I think that we can let some hyperparameters to be changed upon resuming from a checkpoint. The safest are:
total_steps
learning_starts
(already available)prefill_steps
The less safe are the ones regarding batch-sizes, gradient steps and related and the Fabric's one: the number of parallel processes influence the learning dynamics as well as the batch-sizes and gradient steps. To modify those we need to make sure that everything runs the same even when we modify these parameters
I was about to open another issue before spotting this, but you can't currently change total_steps
when using resume_from
.
Currently cli.py
doesn't support modification of total_steps
as it just retains the old config total_steps
. I would think extending an experiment is a common use case, so I would suggest modifying cli.py
to also include old_cfg.algo.pop("total_steps",None)
.
Yeah, you're right: that's the only modification. I'll open a PR asap
I've created a branch here where you can modify also the total_steps
Currently, when running with checkpoint.resume_from, it will take the config settings from the previous run. Correct? Is it also possible to overwrite (some) of the settings for the new run - i.e. the total_steps.
Thanks