HRNet / DEKR

This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)
MIT License
438 stars 76 forks source link

installation issue: ncclSystemError: System call (socket, malloc, munmap, etc) failed. #18

Closed looninho closed 2 years ago

looninho commented 3 years ago

Hi,

thank you for sharing your work.

I'm trying to test DEKR but facing with NCLL issue. When I run the train.py, it returns error:

RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Could you give me some tips to overcome this?

Environment: CUDA: GPU:

System:

longpeace commented 2 years ago

I met the same problem.Do you know how to solve it now? Thanks a lot if you can inform me!!!

looninho commented 2 years ago

Hi @longpeace,

I solved the ncclSystemError issue by adding --ipc=host flag in the docker command.

looninho commented 2 years ago

[SOLVED]