MVIG-SJTU / AlphaPose

Real-Time and Accurate Full-Body Multi-Person Pose Estimation&Tracking System
http://mvig.org/research/alphapose.html
Other
7.84k stars 1.96k forks source link

No space left on device #1163

Open WhaSukGO opened 11 months ago

WhaSukGO commented 11 months ago

How to reproduce

Running on nvcr.io/nvidia/cuda:11.7.1-devel-ubuntu20.04 container

python scripts/demo_inference.py \
    --cfg downloaded/multi_domain_model/256x192_res50_lr1e-3_2x-regression.yaml \
    --checkpoint downloaded/multi_domain_model/multi_domain_fast50_regression_256x192.pth \
    --video downloaded/serve_test_13.mp4 \
    --detbatch 1 \
    --posebatch 1 \
    --save_video \
    --pose_track

Error

...

loading reid model from trackers/weights/osnet_ain_x1_0_msmt17_256x128_amsgrad_ep50_lr0.0015_coslr_b64_fb10_softmax_labsmth_flip_jitter.pth...
Traceback (most recent call last):
  File "scripts/demo_inference.py", line 208, in <module>
    writer = DataWriter(cfg, args, save_video=True, video_save_opt=video_save_opt, queueSize=queueSize).start()
  File "/Development/AlphaPose/alphapose/utils/writer.py", line 40, in __init__
    self.result_queue = mp.Queue(maxsize=queueSize)
  File "/usr/lib/python3.7/multiprocessing/context.py", line 102, in Queue
    return Queue(maxsize, ctx=self.get_context())
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 42, in __init__
    self._rlock = ctx.Lock()
  File "/usr/lib/python3.7/multiprocessing/context.py", line 67, in Lock
    return Lock(ctx=self.get_context())
  File "/usr/lib/python3.7/multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "/usr/lib/python3.7/multiprocessing/synchronize.py", line 59, in __init__
    unlink_now)
OSError: [Errno 28] No space left on device
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/Development/AlphaPose/ENV_37/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 355, in reduce_storage
    metadata = storage._share_filename_cpu_()
RuntimeError: unable to write to file </torch_303096_264530030_0>: No space left on device (28)

Debug

Checking the storage via df -h command,

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay         1.8T   26G  1.7T   2% /
tmpfs            64M     0   64M   0% /dev
tmpfs            16G     0   16G   0% /sys/fs/cgroup
shm              64M   57M  7.9M  88% /dev/shm
/dev/sda1       1.8T   26G  1.7T   2% /data
/dev/nvme0n1p3  733G   76G  620G  11% /tmp/.X11-unix
tmpfs            16G   12K   16G   1% /proc/driver/nvidia
tmpfs           3.2G  3.4M  3.2G   1% /run/nvidia-persistenced/socket
udev             16G     0   16G   0% /dev/nvidia0
tmpfs            16G     0   16G   0% /proc/asound
tmpfs            16G     0   16G   0% /proc/acpi
tmpfs            16G     0   16G   0% /proc/scsi
tmpfs            16G     0   16G   0% /sys/firmware

I'm currently attempting to enlarge shm

WhaSukGO commented 11 months ago

I have created new container with larger shm, and it worked!

$ docker run --gpus all --env="DISPLAY" -it --shm-size=16G -v /tmp/.X11-unix:/tmp/.X11-unix -v /hdd/data:/data --name=ubuntu2004_cu117_0 nvcr.io/nvidia/cuda:11.7.1-devel-ubuntu20.04