exp_manager::check_resume archive log files to run_0 folder incorrectly when train from scratch

LiGeNvidia commented 3 months ago

Describe the bug

This code block in exp_manager::checkresume seems to attempt to archive all log files from previous run to a folder `run, even when train from scratch. However, due to the different timing in initialization, each task could producelog_dir / f'nemo_log_globalrank-{global_rank}_localrank-{local_rank}.txt'before or after this code block. Therefore some of the log files are moved torun_0even though they were produced for current run. Notice [seconds_to_sleep](https://github.com/NVIDIA/NeMo/blob/72f630d087d45655b1a069dc72debf01dfdbdb2d/nemo/utils/exp_manager.py#L612) didn't help here becauselog_dir / f'nemo_log_globalrank-{global_rank}_localrank-{local_rank}.txt'` is already generated for this non-zero rank worker.

Steps/Code to reproduce bug

Run megatron_gpt_pretraining.py for any model from scatch with multiple tasks. In my example, I used 1 DGXH100 through slurm, there are 8 tasks in total (1 gpu for each task). The explicit_log_dir end up like following:

├── cmd-args.log
├── events.out.tfevents.1721866799.##########.2542397.0
├── git-info.log
├── hparams.yaml
├── lightning_logs.txt
├── nemo_error_log.txt
├── nemo_log_globalrank-0_localrank-0.txt
├── nemo_log_globalrank-1_localrank-1.txt
├── nemo_log_globalrank-3_localrank-3.txt
├── nemo_log_globalrank-7_localrank-7.txt
└── run_0
    ├── nemo_log_globalrank-2_localrank-2.txt
    ├── nemo_log_globalrank-4_localrank-4.txt
    ├── nemo_log_globalrank-5_localrank-5.txt
    └── nemo_log_globalrank-6_localrank-6.txt

1 directory, 14 files

Notice for rank 2, 4, 5, 6, log files for this current run are archived in run_0 incorrectly.

Expected behavior

When training from scatch, run_0 folder should not be created, everything will be logged to log_dir directly.
When resume training, only log files from previous run will be archived to run_<count> folder, log files for current run will still be logged to log_dir directly.

Environment overview (please complete the following information) Not relevant to this bug.

Environment details Not relevant to this bug.

Additional context None

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

NVIDIA / NeMo

exp_manager::check_resume archive log files to run_0 folder incorrectly when train from scratch #9952