THU-MIG / RepViT

RepViT: Revisiting Mobile CNN From ViT Perspective [CVPR 2024] and RepViT-SAM: Towards Real-Time Segmenting Anything
https://arxiv.org/abs/2307.09283
Apache License 2.0
799 stars 60 forks source link

[ERROR] failed (exitcode: 1) local_rank: 2 (pid: 60554) of binary #65

Open sankexin opened 5 months ago

sankexin commented 5 months ago

when train with imagenet:[ERROR] failed (exitcode: 1) local_rank

Epoch: [7] [ 0/1251] eta: 1:43:57 lr: 0.001998 loss: 5.2755 (5.2755) time: 4.9860 data: 4.4130 max mem: 19285 Epoch: [7] [ 100/1251] eta: 0:15:17 lr: 0.001998 loss: 4.6559 (4.7341) time: 0.7736 data: 0.0005 max mem: 19285 Epoch: [7] [ 200/1251] eta: 0:13:31 lr: 0.001998 loss: 4.5642 (4.7046) time: 0.7292 data: 0.0004 max mem: 19285 [2024-05-28 14:04:27,954] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 60552 closing signal SIGTERM [2024-05-28 14:04:27,954] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 60553 closing signal SIGTERM [2024-05-28 14:04:27,955] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 60555 closing signal SIGTERM [2024-05-28 14:04:28,584] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 60554) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

jameslahm commented 5 months ago

Thanks for your interest! Could you please provide more details of how to reproduce this issue?