Closed LiGeNvidia closed 1 month ago
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Describe the bug
This code block in exp_manager::checkresume seems to attempt to archive all log files from previous run to a folder `run
, even when train from scratch. However, due to the different timing in initialization, each task could produce
log_dir / f'nemo_log_globalrank-{global_rank}_localrank-{local_rank}.txt'before or after this code block. Therefore some of the log files are moved to
run_0even though they were produced for current run. Notice [seconds_to_sleep](https://github.com/NVIDIA/NeMo/blob/72f630d087d45655b1a069dc72debf01dfdbdb2d/nemo/utils/exp_manager.py#L612) didn't help here because
log_dir / f'nemo_log_globalrank-{global_rank}_localrank-{local_rank}.txt'` is already generated for this non-zero rank worker.Steps/Code to reproduce bug
Run
megatron_gpt_pretraining.py
for any model from scatch with multiple tasks. In my example, I used 1 DGXH100 through slurm, there are 8 tasks in total (1 gpu for each task). Theexplicit_log_dir
end up like following:Notice for rank 2, 4, 5, 6, log files for this current run are archived in
run_0
incorrectly.Expected behavior
run_0
folder should not be created, everything will be logged tolog_dir
directly.run_<count>
folder, log files for current run will still be logged to log_dir directly.Environment overview (please complete the following information) Not relevant to this bug.
Environment details Not relevant to this bug.
Additional context None