Open bylowerik opened 5 years ago
Did you check the PATH? bashrc file?
In the docker image the path is:
/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
How should it look like?
From this thread it is clear that many have problem with the new
RTX 2080 card
. I have done further investigations and found that one might need to update the docker file.Here is a docker file for building
caffe2
, and there is the following line-DCUDA_ARCH_BIN="35 52 60 61" \ -DCUDA_ARCH_PTX="61" \
RTX 2080
has computing capability75
so I updated those lines to-DCUDA_ARCH_BIN="35 52 60 61 75" \ -DCUDA_ARCH_PTX="75" \.
I then copy pasted the entire dockerfile for caffe2 into the dockerfile for
Densepose
and deleted the top rowFROM caffe2/caffe2:snapshot-py2-cuda9.0-cudnn7-ubuntu16.04
in the Densepose dockerfile.To make it build I needed to append
typing
to the pip installations. Also, when densepose starts to build, it searches forcaffe2
in/usr/local/caffe2
. That folder is empty, so I changeRUN mv /usr/local/caffe2 /usr/local/caffe2_build
to
RUN mv /pytorch/caffe2 /usr/local/caffe2_build.
However, I am a bit suspicious since when building caffe2, the
build
directory is removed after compilation. But in Densepose's dockerfile, the repository is called/usr/local/caffe2_build
. It is not clear how one should change this part.Nevertheless, the final docker image builds and the following command works:
nvidia-docker run --rm -it densepose:c2-cuda9-cudnn7 python2 detectron/tests/test_batch_permutation_op.py
Running the inference though, does not work.
python2 tools/infer_simple.py --cfg configs/DensePose_ResNet101_FPN_s1x-e2e.yaml --output-dir DensePoseData/infer_out/ --image-ext jpg --wts https://dl.fbaipublicfiles.com/densepose/DensePose_ResNet101_FPN_s1x-e2e.pkl DensePoseData/demo_data/demo_im.jpg
Results in
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 236, in RunNet StringifyNetName(name), num_iter, allow_fail, File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 197, in CallWithExceptionIntercept return func(*args, **kwargs) RuntimeError: [enforce fail at conv_op_cudnn.cc:807] status == CUDNN_STATUS_SUCCESS. 8 vs 0. , Error at: /pytorch/caffe2/operators/conv_op_cudnn.cc:807: CUDNN_STATUS_EXECUTION_FAILED Error from operator: input: "gpu_0/data" input: "gpu_0/conv1_w" output: "gpu_0/conv1" name: "" type: "Conv" arg { name: "kernel" i: 7 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 3 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 2 } device_option { device_type: 1 device_id: 0 } engine: "CUDNN"frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*) + 0x68 (0x7f27103cf7e8 in /usr/local/lib/libc10.so) frame #1: <unknown function> + 0x103080e (0x7f268c7e880e in /usr/local/lib/libcaffe2_gpu.so) frame #2: bool caffe2::CudnnConvOp::DoRunWithType<float, float, float, float>() + 0x85e (0x7f268c7effee in /usr/local/lib/libcaffe2_gpu.so) frame #3: caffe2::CudnnConvOp::RunOnDevice() + 0xa0 (0x7f268c7e1fb0 in /usr/local/lib/libcaffe2_gpu.so) frame #4: <unknown function> + 0xf70245 (0x7f268c728245 in /usr/local/lib/libcaffe2_gpu.so) frame #5: caffe2::AsyncNetBase::run(int, int) + 0x154 (0x7f26f1a8ef44 in /usr/local/lib/libcaffe2.so) frame #6: <unknown function> + 0x1338d25 (0x7f26f1a8ad25 in /usr/local/lib/libcaffe2.so) frame #7: c10::ThreadPool::main_loop(unsigned long) + 0x2eb (0x7f26f0b4822b in /usr/local/lib/libcaffe2.so) frame #8: <unknown function> + 0xb8c80 (0x7f2715b16c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #9: <unknown function> + 0x76ba (0x7f271bfbc6ba in /lib/x86_64-linux-gnu/libpthread.so.0) frame #10: clone + 0x6d (0x7f271bcf241d in /lib/x86_64-linux-gnu/libc.so.6)
It seems like it does not find everything? In particular, I do not understand this:
Error at: /pytorch/caffe2/operators/conv_op_cudnn.cc:807:
While building the docker,
caffe2
is moved frompytorch
tocaffe2_build
so why is the error under pytorch?Any ideas?
Same error at RTX2080. Hope Someone can help.
I managed to run densepose on an RTX 2080ti using the following Dockerfile. I hope this helps!
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
RUN apt-get -y update
RUN apt-get install -y --no-install-recommends \
build-essential \
git \
libgoogle-glog-dev \
libgtest-dev \
libiomp-dev \
libleveldb-dev \
liblmdb-dev \
libopencv-dev \
libopenmpi-dev \
libsnappy-dev \
libprotobuf-dev \
openmpi-bin \
openmpi-doc \
protobuf-compiler \
python-dev \
python-pip
RUN pip install setuptools
RUN pip install --user \
future \
numpy \
protobuf \
typing \
hypothesis
RUN apt-get install -y --no-install-recommends \
libgflags-dev \
cmake
RUN git clone --branch master --recursive https://github.com/pytorch/pytorch.git
RUN pip install typing pyyaml
WORKDIR /pytorch
RUN git submodule update --init --recursive
RUN python setup.py install
RUN git clone https://github.com/facebookresearch/densepose /densepose
# Install Python dependencies
RUN pip install -U pip
RUN pip install -r /densepose/requirements.txt
# Install the COCO API
RUN git clone https://github.com/cocodataset/cocoapi.git /cocoapi
WORKDIR /cocoapi/PythonAPI
ENV PYTHONPATH /usr/local
ENV Caffe2_DIR=/usr/local/lib/python2.7/dist-packages/torch/share/cmake/Caffe2/
ENV PYTHONPATH=${PYTHONPATH}:/pytorch/build
ENV LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
ENV LD_LIBRARY_PATH=/usr/local/lib/python2.7/dist-packages/torch/lib/:${LD_LIBRARY_PATH}
ENV LIBRARY_PATH=/usr/local/lib/python2.7/dist-packages/torch/lib/:${LIBRARY_PATH}
ENV C_INCLUDE_PATH=/usr/local/lib/python2.7/dist-packages/torch/lib/include/:${C_INCLUDE_PATH}
ENV CPLUS_INCLUDE_PATH=/usr/local/lib/python2.7/dist-packages/torch/lib/include/:${CPLUS_INCLUDE_PATH}
ENV C_INCLUDE_PATH=/pytorch/:${C_INCLUDE_PATH}
ENV CPLUS_INCLUDE_PATH=/pytorch/:${CPLUS_INCLUDE_PATH}
ENV C_INCLUDE_PATH=/pytorch/build/:${C_INCLUDE_PATH}
ENV CPLUS_INCLUDE_PATH=/pytorch/build/:${CPLUS_INCLUDE_PATH}
ENV C_INCLUDE_PATH=/pytorch/torch/lib/include/:${C_INCLUDE_PATH}
ENV CPLUS_INCLUDE_PATH=/pytorch/torch/lib/include/:${CPLUS_INCLUDE_PATH}
RUN make install
WORKDIR /densepose
RUN make
RUN make ops
RUN apt-get -y update \
&& apt-get -y install \
wget \
software-properties-common
WORKDIR /densepose/DensePoseData
RUN bash get_densepose_uv.sh
WORKDIR /densepose
@elkaps yes it helps, thanks!
Though I had to remove the RUN make ops
to be able to build your Dockerfile.
Any ideas why this is working compared to the facebookresearch/DensePose Dockerfile? The main difference I see is that you are compiling pytorch from source instead of using conda.
@johannabar Yes, the dockerfile uses cuda 10 instead of cuda 8 and compiles pytorch from source using setup.py
RUN pip install setuptools
RUN pip install --user \ future \ numpy==1.14.0 \ protobuf==3.5.1 \ typing \ hypothesis
The build failed untill specific versions are provided.
For me with ubuntu 18.04, 2080ti and cuda 10 the following pip install versions were needed for it to work: pyYAML==3.12 numpy==1.14.0 protobuf==3.11.1
For me with ubuntu 18.04, 2080ti and cuda 10 the following pip install versions were needed for it to work: pyYAML==3.12 numpy==1.14.0 protobuf==3.11.1 Hey @joel-simon im on ubuntu 18.04 and rtx2080, im still having trouble dockfile on "make ops", can you share your Dockerfile or share how you installed?
From this thread it is clear that many have problem with the new
RTX 2080 card
. I have done further investigations and found that one might need to update the docker file.Here is a docker file for building
caffe2
, and there is the following lineRTX 2080
has computing capability75
so I updated those lines toI then copy pasted the entire dockerfile for caffe2 into the dockerfile for
Densepose
and deleted the top rowFROM caffe2/caffe2:snapshot-py2-cuda9.0-cudnn7-ubuntu16.04
in the Densepose dockerfile.To make it build I needed to append
typing
to the pip installations. Also, when densepose starts to build, it searches forcaffe2
in/usr/local/caffe2
. That folder is empty, so I changeto
However, I am a bit suspicious since when building caffe2, the
build
directory is removed after compilation. But in Densepose's dockerfile, the repository is called/usr/local/caffe2_build
. It is not clear how one should change this part.Nevertheless, the final docker image builds and the following command works:
Running the inference though, does not work.
Results in
It seems like it does not find everything? In particular, I do not understand this:
While building the docker,
caffe2
is moved frompytorch
tocaffe2_build
so why is the error under pytorch?Any ideas?