ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 71517)

Light-- commented 2 years ago

when i run the code

tools/dist_test.sh projects/configs//detr3d/detr3d_vovnet_gridmask_det_final_trainval_cbgs.py /path/to/ckpt 1 --eval=bbox

i got this error at last:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 71517) of binary: /home/xxx/bin/python
Traceback (most recent call last):
  File "/home/xxx/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/xxx/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xxx/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/xxx/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/xxx/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/xxx/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/home/xxx/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xxx/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools/test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-02-22_20:41:39
  host      : k8s-deploy-xxx-xxx-xxx-xxx
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 71517)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 71517
============================================================

i dont'k know why... what's happened?

Light-- commented 2 years ago

自问自答：换个大内存的机器再试一下吧，比如40GB的

a1600012888 commented 2 years ago

Hi, You need around 20 GB to fit ResNet101 with BatchSize as 1. For VovNet, you need larger. See similar: https://github.com/WangYueFt/detr3d/issues/21

Suodislie commented 2 years ago

Hi, do you have solved it? I have the same problem as you.

a1600012888 commented 2 years ago

Hi, do you have solved it? I have the same problem as you.

He seems to have memory issue, and he fixed this after using a GPU with higher memory.

You need around 20GB to train ResNet-101 with FPN with batch-size as 1 on a single GPU. To train VoVNet, you need higher

Do you see any cuda out of memory in your error log?

Sicily-love commented 2 years ago

Hi, do you have solved it? I have the same problem as you.

He seems to have memory issue, and he fixed this after using a GPU with higher memory.

You need around 20GB to train ResNet-101 with FPN with batch-size as 1 on a single GPU. To train VoVNet, you need higher

Do you see any cuda out of memory in your error log?

I do see 'cuda out of memory' in my error log. How can i do to solve this problem? Thank you.

hnumrx commented 1 year ago

Hi, do you have solved it? I have the same problem as you.

He seems to have memory issue, and he fixed this after using a GPU with higher memory. You need around 20GB to train ResNet-101 with FPN with batch-size as 1 on a single GPU. To train VoVNet, you need higher Do you see any cuda out of memory in your error log?

I do see 'cuda out of memory' in my error log. How can i do to solve this problem? Thank you.

Maybe you can reduce the batchsize.

Wangqk-wqk commented 1 year ago

it's not useful

Hi, do you have solved it? I have the same problem as you.

He seems to have memory issue, and he fixed this after using a GPU with higher memory. You need around 20GB to train ResNet-101 with FPN with batch-size as 1 on a single GPU. To train VoVNet, you need higher Do you see any cuda out of memory in your error log?

I do see 'cuda out of memory' in my error log. How can i do to solve this problem? Thank you.

Maybe you can reduce the batchsize.

but i use this method without any benifit

WangYueFt / detr3d

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 71517) #19