torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

ZhehengJiangLancaster / AMVUR

18 stars 0 forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #2

Open Chinatown123 opened 11 months ago

Chinatown123 commented 11 months ago

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: experiments/supervised HO3D V2.py FAILED Failures:

Root Cause (first observed failure) : [0]: time host 2023-09-14 01:45:50 ding-X10DRG rank2 (local rank: 2)exitcode: -9 (pid: 36071)error file: traceback : signal 9 (SIGKILL) received by PID 36071

Chinatown123 commented 11 months ago

I try to run: python -m torch.distributed.launch --nproc_per_node=4 \ experiments/supervised_HO3D_v2.py \ --config_json ./experiments/config/train.json During the fifth epoch, this error accurred.

Chinatown123 commented 11 months ago

more specifically, when I try to run: python -m torch.distributed.launch --nproc_per_node=4 experiments/supervised_HO3D_v2.py --config_json ./experiments/config/train.json the total memory currently in use become larger and larger, and finally the error accured. It is still not solved.

JoyboyWang commented 10 months ago

Hi, I encountered the same problem. Is there any solution?

Chinatown123 commented 10 months ago

Hi, I encountered the same problem. Is there any solution?

Not yet. Have you solved the problem right now?

JoyboyWang commented 10 months ago

Hi, I encountered the same problem. Is there any solution?

Not yet. Have you solved the problem right now?

No... I tried "torch.cuda.empty_cache()" and "gc.collect()" in each iteration, but they seem not to work.

Chinatown123 commented 10 months ago

Hi, I encountered the same problem. Is there any solution?

Not yet. Have you solved the problem right now?

No... I tried "torch.cuda.empty_cache()" and "gc.collect()" in each iteration, but they seem not to work.

I have also tried them and they didn't work. Have you solved it now? I turn back to this project today but still have no idea to solve it.