Hi, congrats on your excellent work. I tried to train your model from scratch but have the following error that has been bothering me for several days:
Use Cosine LR scheduler
Set warmup steps = 1402520
Set warmup steps = 0
Max WD = 0.0500000, Min WD = 0.0500000
criterion = SoftTargetCrossEntropy()
Auto resume checkpoint:
Start training for 300 epochs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 2594047) of binary: /home/shiweil/miniconda3/envs/torch110_py37/bin/python
Traceback (most recent call last):
File "/home/shiweil/miniconda3/envs/torch110_py37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/shiweil/miniconda3/envs/torch110_py37/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/shiweil/miniconda3/envs/torch110_py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/shiweil/miniconda3/envs/torch110_py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/shiweil/miniconda3/envs/torch110_py37/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/shiweil/miniconda3/envs/torch110_py37/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/shiweil/miniconda3/envs/torch110_py37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/shiweil/miniconda3/envs/torch110_py37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
main.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-04-08_20:12:22
host : gcn21.local.snellius.surf.nl
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 2594047)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 2594047
========================================================
Hi, congrats on your excellent work. I tried to train your model from scratch but have the following error that has been bothering me for several days:
My command is "CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 main.py --model convnext_tiny --drop_path 0.1 --batch_size 128 --lr 4e-3 --update_freq 1 --model_ema true --model_ema_eval true --data_path /projects/2/managed_datasets/imagenet10k/ ".
Do you know what's going on here? Any answers will be highly appreciated!
Best, Shiwei Liu