PyTorch Version: `torch_shm_manager` error when running with multiprocessing

alex-razor commented 5 years ago

Running code doesnt work. I get the following error:

(venv) juggernaut@xmen9:/hdd/AlphaPose$ python demo.py --indir examples/demo/
Loading YOLO model..
torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 314, in reduce_storage
    metadata = storage._share_filename_()
RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99
torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 314, in reduce_storage
    metadata = storage._share_filename_()
RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99
torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory
torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "demo.py", line 50, in <module>
    det_loader = DetectionLoader(data_loader, batchSize=args.detbatch).start()
  File "/hdd/AlphaPose/dataloader.py", line 309, in start
    p.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 291, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 314, in reduce_storage
    metadata = storage._share_filename_()
RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99

Although, when i add flag --sp it works fine.

Python 3.6
CUDA 9.0
CUDNN 7
torch 1.2.0    
torchfile 0.1.0    
torchvision 0.4.0

Fang-Haoshu commented 5 years ago

Hi, can you try modifying line 26 of 'demo.py' as below? torch.multiprocessing.set_start_method('spawn', force=True)

alex-razor commented 5 years ago

Hi, can you try modifying line 26 of 'demo.py' as below? torch.multiprocessing.set_start_method('spawn', force=True)

Thank you for your reply. However, it didn't help. same error.

Fang-Haoshu commented 5 years ago

Oh, it's so weird.. We have only tested for PyTorch 1.1 so far. Can you check if PyTorch 1.1 works for you?

alex-razor commented 5 years ago

That did work for me. Thanks!

David-on-Code commented 5 years ago

RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99

how can i solve it?

schmmd commented 5 years ago

I'm also hitting this, but on torch==1.3.0

maochen commented 5 years ago

same on torch==1.3.0 os: MacOS 10.14.6

waiting-gy commented 4 years ago

RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99

how can i solve it?

do you know how to solve it? thank you!

Abhipray commented 4 years ago

I was seeing this error with 1.3.0. Upgrading to 1.3.1 fixed it for me.

asheeshcric commented 4 years ago

@Abhipray I have torch==1.3.1 installed, but it isn't working for me. I get the same error. Has anyone found the solution to this problem?

Ehsan-Yaghoubi commented 4 years ago

I had the same problem. When I used the following versions, Alphapose worked and generated a Jason file for the images.

I created a virtual environment with Python 3.6. If you don't know how to do it, have a look at https://gist.github.com/frfahim/73c0fad6350332cef7a653bcd762f08d
I installed the latest version of PyTorch using https://pytorch.org/ and selected CUDA 9.2 (Cuda 10.0 did not work) I used (pip3 install torch==1.3.1+cu92 torchvision==0.4.2+cu92 -f https://download.pytorch.org/whl/torch_stable.html)
I installed Cuda 9.2 from https://developer.nvidia.com/cuda-92-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

Then follow the instruction of the Alphapose that says download the models and:

git clone -b pytorch https://github.com/MVIG-SJTU/AlphaPose.git -pip3 install -r requirements.txt (remove the torch and torchvision and ntpath from this file and then run this code)
python3 demo.py --indir examples/demo --outdir examples/res

SUMMARY:

Linux 16.04
Python3.6
CUDA 9.2
CUDNN 7
torch==1.3.1+cu92
torchvision==0.4.2+cu92
GPU NVIDIA 2080ti

phamdat09 commented 4 years ago

Hello !!! @Ehsan-Yaghoubi , how many FPS did you get ? Thanks

Ehsan-Yaghoubi commented 4 years ago

Hello !!! @Ehsan-Yaghoubi , how many FPS did you get ? Thanks

Hi, I only used it to produce the pose information for my own dataset. I didn't check the metrics as I didn't need them.

phamdat09 commented 4 years ago

Hi !! @Ehsan-Yaghoubi thank for your reply !!

cslxiao commented 4 years ago

It still happens with PyTorch 1.4

cdyangbo commented 4 years ago

Set num_workers=0

cdyangbo commented 4 years ago

torch.multiprocessing.set_start_method('spawn', force=True) work well with num_works > 0 in macos

nlml commented 4 years ago

I was just able to fix this by commenting a line I had added to fix an issue on a different system:

Old: torch.multiprocessing.set_sharing_strategy('file_system')

New: # torch.multiprocessing.set_sharing_strategy('file_system')

I think the problem in my case might be caused by my system having CUDA 10.2 while Pytorch is installed as the 10.1 version. But commenting the above line at the start of my script fixed the problem, at least in my case.

Amir22010 commented 4 years ago

@nlml works for me thanks!! i have pytorch 1.4 with cuda 10.2...

qhdqhd commented 3 years ago

add --sp works fine for me

Zrrr1997 commented 3 years ago

Hitting the same error:

(alphapose) zrrr@zrrr-GL552VW:~/Projects/AlphaPose$ python scripts/demo_inference.py --cfg configs/coco/resnet/256x192_res50_lr1e-3_1x.yaml --checkpoint pretrained_models/fast_res50_256x192.pth --indir examples/demo/

Traceback (most recent call last):
  File "scripts/demo_inference.py", line 175, in <module>
    det_loader = DetectionLoader(input_source, get_detector(args), cfg, args, batchSize=args.detbatch, mode=mode, queueSize=args.qsize)
  File "/home/zrrr/Projects/AlphaPose/detector/apis.py", line 12, in get_detector
    from detector.yolo_api import YOLODetector
  File "/home/zrrr/Projects/AlphaPose/detector/yolo_api.py", line 27, in <module>
    from detector.nms import nms_wrapper
  File "/home/zrrr/Projects/AlphaPose/detector/nms/__init__.py", line 1, in <module>
    from .nms_wrapper import nms, soft_nms
  File "/home/zrrr/Projects/AlphaPose/detector/nms/nms_wrapper.py", line 4, in <module>
    from . import nms_cpu, nms_cuda
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory

Python 3.6.13
Cuda Toolkit 9.0
cudnn 7.6.5
torch 1.1.0
torchvision 0.3.0

How can I fix this?

maochen commented 3 years ago

Hitting the same error:

(alphapose) zrrr@zrrr-GL552VW:~/Projects/AlphaPose$ python scripts/demo_inference.py --cfg configs/coco/resnet/256x192_res50_lr1e-3_1x.yaml --checkpoint pretrained_models/fast_res50_256x192.pth --indir examples/demo/

Traceback (most recent call last):
  File "scripts/demo_inference.py", line 175, in <module>
    det_loader = DetectionLoader(input_source, get_detector(args), cfg, args, batchSize=args.detbatch, mode=mode, queueSize=args.qsize)
  File "/home/zrrr/Projects/AlphaPose/detector/apis.py", line 12, in get_detector
    from detector.yolo_api import YOLODetector
  File "/home/zrrr/Projects/AlphaPose/detector/yolo_api.py", line 27, in <module>
    from detector.nms import nms_wrapper
  File "/home/zrrr/Projects/AlphaPose/detector/nms/__init__.py", line 1, in <module>
    from .nms_wrapper import nms, soft_nms
  File "/home/zrrr/Projects/AlphaPose/detector/nms/nms_wrapper.py", line 4, in <module>
    from . import nms_cpu, nms_cuda
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory

Python 3.6.13
Cuda Toolkit 9.0
cudnn 7.6.5
torch 1.1.0
torchvision 0.3.0

How can I fix this?

Could you try any version of torch >= 1.3.1 to see if the issue still there?

qhdqhd commented 3 years ago

add --sp is ok

angerhang commented 2 years ago

I was just able to fix this by commenting a line I had added to fix an issue on a different system:

Old: torch.multiprocessing.set_sharing_strategy('file_system')

New: # torch.multiprocessing.set_sharing_strategy('file_system')

I think the problem in my case might be caused by my system having CUDA 10.2 while Pytorch is installed as the 10.1 version. But commenting the above line at the start of my script fixed the problem, at least in my case.

I had to do the same to make the code work on Linux. Any ideas why so strange?

tianhangpan commented 9 months ago

Hi, can you try modifying line 26 of 'demo.py' as below? torch.multiprocessing.set_start_method('spawn', force=True)

Thanks, that work for me on the Linux!

MVIG-SJTU / AlphaPose

PyTorch Version: `torch_shm_manager` error when running with multiprocessing #402