Using 6 dataloader workers
Logging results to gun/yolov73
Starting training for 300 epochs...
Epoch gpu_mem box obj cls total labels img_size
0%| | 0/694 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/yao/project/detection/yolov7/train.py", line 609, in <module>
train(hyp, opt, device, tb_writer)
File "/home/yao/project/detection/yolov7/train.py", line 369, in train
scaler.scale(loss).backward()
File "/home/yao/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/yao/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30240 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30241 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30242 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30244 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30245 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30246 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 30247 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 30243) of binary: /home/yao/miniconda3/bin/python3
Traceback (most recent call last):
File "/home/yao/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/yao/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/yao/miniconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/yao/miniconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/yao/miniconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/yao/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/yao/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yao/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-07-14_11:43:09
host : ubunt
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 30243)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================