Open Chinatown123 opened 11 months ago
I try to run: python -m torch.distributed.launch --nproc_per_node=4 \ experiments/supervised_HO3D_v2.py \ --config_json ./experiments/config/train.json During the fifth epoch, this error accurred.
more specifically, when I try to run: python -m torch.distributed.launch --nproc_per_node=4 experiments/supervised_HO3D_v2.py --config_json ./experiments/config/train.json the total memory currently in use become larger and larger, and finally the error accured. It is still not solved.
Hi, I encountered the same problem. Is there any solution?
Hi, I encountered the same problem. Is there any solution?
Not yet. Have you solved the problem right now?
Hi, I encountered the same problem. Is there any solution?
Not yet. Have you solved the problem right now?
No... I tried "torch.cuda.empty_cache()" and "gc.collect()" in each iteration, but they seem not to work.
Hi, I encountered the same problem. Is there any solution?
Not yet. Have you solved the problem right now?
No... I tried "torch.cuda.empty_cache()" and "gc.collect()" in each iteration, but they seem not to work.
I have also tried them and they didn't work. Have you solved it now? I turn back to this project today but still have no idea to solve it.
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: experiments/supervised HO3D V2.py FAILED Failures: