A problem in running run.sh

zkailong commented 6 years ago

Environment：ubuntu 16.04;cuda 9.0.176;cuDNN 7.0.5;TensorFlow 1.6.0(gpu). Reference to #10 #3, I've been installed torch,lucrocks,hdf5,etc...But there are still problems running...

name@name-All-Series:~/AlphaPose$ ./run.sh --indir examples/demo/ --outdir examples/results/ --vis 0 generating bbox from Faster RCNN... /usr/local/lib/python2.7/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters 2018-03-06 17:32:48.947257: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-03-06 17:32:49.022229: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-03-06 17:32:49.022485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: name: GeForce GTX 750 major: 5 minor: 2 memoryClockRate(GHz): 1.188 pciBusID: 0000:01:00.0 totalMemory: 1.95GiB freeMemory: 1.64GiB 2018-03-06 17:32:49.022502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0 2018-03-06 17:32:49.244052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1403 MB memory) -> physical GPU (device: 0, name: GeForce GTX 750, pci bus id: 0000:01:00.0, compute capability: 5.2) Loaded network ../output/res152/coco_2014_train+coco_2014_valminusminival/default/res152.ckpt /home/name/AlphaPose/examples/demo/ 0%| | 0/3 [00:00<?, ?it/s]2018-03-06 17:32:55.388581: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2018-03-06 17:32:55.500633: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2018-03-06 17:32:56.432382: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 922.50MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 67%|██████████████████████████████ | 2/3 [00:07<00:03, 3.52s/it]2018-03-06 17:33:01.331154: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2018-03-06 17:33:01.411505: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2018-03-06 17:33:01.501123: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2018-03-06 17:33:02.435389: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2018-03-06 17:33:02.543607: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 100%|█████████████████████████████████████████████| 3/3 [00:10<00:00, 3.45s/it] pose estimation with RMPE... /home/name/torch/install/bin/lua: /home/name/torch/install/share/lua/5.2/trepl/init.lua:389: /home/name/torch/install/share/lua/5.2/hdf5/ffi.lua:56: expected align(#) on line 579 stack traceback: [C]: in function 'error' /home/name/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require' /home/name/AlphaPose/predict/util.lua:7: in main chunk [C]: in function 'dofile' /home/name/torch/install/share/lua/5.2/paths/init.lua:84: in function 'dofile' main-alpha-pose.lua:7: in main chunk [C]: in function 'dofile' ...oyer/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: in ? /usr/local/lib/python2.7/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "parametric-pose-nms-MPII.py", line 256, in get_result_json(args) File "parametric-pose-nms-MPII.py", line 243, in get_result_json test_parametric_pose_NMS_json(delta1, delta2, mu, gamma,args.outputpath) File "parametric-pose-nms-MPII.py", line 99, in test_parametric_pose_NMS_json h5file = h5py.File(os.path.join(outputpath,"POSE/test-pose.h5"), 'r') File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 269, in init fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr) File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/files.py", line 99, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 78, in h5py.h5f.open IOError: Unable to open file (unable to open file: name = '/home/name/AlphaPose/examples/results/POSE/test-pose.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0) visualization... Traceback (most recent call last): File "json-video.py", line 63, in with open(jsonpath) as f: IOError: [Errno 2] No such file or directory: '/home/name/AlphaPose/examples/results/POSE/alpha-pose-results-forvis.json'

So,how can I solve it?

sberryman commented 6 years ago

So I spent the better part of the day yesterday trying to get AlphaPose to compile and run inference. I finally figured out a combination that works.

Dockerfile

https://gist.github.com/sberryman/82a6d13a44f9c4a3bfaf9263b36c92ed

Important versions:

cudnn version 5
Tensorflow >= 1.2 AND < 1.3 (if you build tensorflow from source the cudnn version isn't as important. installing from pip it becomes VERY important)
Input and output directories for ./run.sh must be relative to the CWD. Absolute paths do not work!

Even if you don't use Docker you can get a very good idea of the steps I had to take to get AlphaPose running. Also, a lot of those ubuntu dependencies that are installed on line 8 can be removed. Those are left over from another project and I haven't had time to clean them up.

sberryman commented 6 years ago

Your error looks more like it has to do with running out of GPU memory though. Your card (CPU) only has totalMemory: 1.95GiB freeMemory: 1.64GiB

I see RCNN using ~ 4.8GB of memory and Torch was using about 1.8GB with a batch size of 1. That is my experience running on a GTX 1080. I haven't tried my 1080 TI's yet.

Update: human-detection (tensorflow) is set to gpu_options.allow_growth=True so I'm not sure the actual minimum memory requirements.

zkailong commented 6 years ago

@sberryman Thanks for your reply. But it said

The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

So I don't think that the GPU memory of my computer is too less to run AlphaPose. And thanks for your Dockerfile. Maybe I should rebuild it.

sberryman commented 6 years ago

Good luck, I know it took me a LONG time to figure out the right combination of dependencies. Hopefully the dockerfile will point you in the right direction.

Fang-Haoshu commented 6 years ago

Thanks @sberryman for the docker file! @zkailong From the log it seems you meet this problem: https://github.com/deepmind/torch-hdf5/issues/79, and a possible solution is to install torch with Lua5.1

zkailong commented 6 years ago

@Fang-Haoshu Thanks for your reply. I reinstall torch with lua5.1. But it did not work...

Fang-Haoshu commented 6 years ago

Sooooo weird.... In the issue of deepmind, it seems many people also suffer from this problem..

zkailong commented 6 years ago

@Fang-Haoshu So frustrated...I have send an E-mail for you. Maybe we can talk more about it.

wangweihb commented 6 years ago

`zhanghua@zhanghua-System-Product-Name:~/AlphaPose$ ./run.sh --indir examples/demo/ --outdir examples/results/ --vis 0 generating bbox from Faster RCNN... 2018-04-16 15:48:19.729543: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2018-04-16 15:48:20.037014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575 pciBusID: 0000:65:00.0 totalMemory: 10.90GiB freeMemory: 10.44GiB 2018-04-16 15:48:20.037044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-04-16 15:48:20.229660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-04-16 15:48:20.229700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-04-16 15:48:20.229705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-04-16 15:48:20.229898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10102 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1) Loaded network ../output/res152/coco_2014_train+coco_2014_valminusminival/default/res152.ckpt /home/zhanghua/AlphaPose/examples/demo/

100%|█████████████████████████████████████████████| 3/3 [00:03<00:00, 1.12s/it] pose estimation with RMPE... Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.9.0:/usr/local/cuda-9.0/bin:/home/zhanghua/torch/install/bin:/home/zhanghua/bin:/home/zhanghua/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
/home/zhanghua/torch/install/bin/luajit: /home/zhanghua/torch/install/share/lua/5.1/trepl/init.lua:389: /home/zhanghua/torch/install/share/lua/5.1/trepl/init.lua:389: /home/zhanghua/torch/install/share/lua/5.1/cudnn/ffi.lua:1618: /usr/local/cuda/lib64/libcudnn.so.9.0:/usr/local/cuda-9.0/bin:/home/zhanghua/torch/install/bin:/home/zhanghua/bin:/home/zhanghua/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin: cannot open shared object file: No such file or directory stack traceback: [C]: in function 'error' /home/zhanghua/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require' /home/zhanghua/AlphaPose/predict/util.lua:12: in main chunk [C]: in function 'dofile' main-alpha-pose.lua:7: in main chunk [C]: in function 'dofile' ...ghua/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50 Traceback (most recent call last): File "parametric-pose-nms-MPII.py", line 256, in get_result_json(args) File "parametric-pose-nms-MPII.py", line 243, in get_result_json test_parametric_pose_NMS_json(delta1, delta2, mu, gamma,args.outputpath) File "parametric-pose-nms-MPII.py", line 99, in test_parametric_pose_NMS_json h5file = h5py.File(os.path.join(outputpath,"POSE/test-pose.h5"), 'r') File "/usr/lib/python2.7/dist-packages/h5py/_hl/files.py", line 272, in init fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr) File "/usr/lib/python2.7/dist-packages/h5py/_hl/files.py", line 92, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2577) File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2536) File "h5py/h5f.pyx", line 76, in h5py.h5f.open (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/h5f.c:1811) IOError: Unable to open file (Unable to open file: name = '/home/zhanghua/alphapose/examples/results/pose/test-pose.h5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) visualization... Traceback (most recent call last): File "json-video.py", line 63, in with open(jsonpath) as f: IOError: [Errno 2] No such file or directory: '/home/zhanghua/AlphaPose/examples/results/POSE/alpha-pose-results-forvis.json' `

This is my problem. Who can help me?thanks

MVIG-SJTU / AlphaPose

A problem in running run.sh #29

Dockerfile

Important versions: