Further investigation with RTX2080

bylowerik commented 5 years ago

From this thread it is clear that many have problem with the new RTX 2080 card. I have done further investigations and found that one might need to update the docker file.

Here is a docker file for building caffe2, and there is the following line

-DCUDA_ARCH_BIN="35 52 60 61" \
-DCUDA_ARCH_PTX="61" \

RTX 2080 has computing capability 75 so I updated those lines to

-DCUDA_ARCH_BIN="35 52 60 61 75" \
-DCUDA_ARCH_PTX="75" \.

I then copy pasted the entire dockerfile for caffe2 into the dockerfile for Densepose and deleted the top row FROM caffe2/caffe2:snapshot-py2-cuda9.0-cudnn7-ubuntu16.04 in the Densepose dockerfile.

To make it build I needed to append typing to the pip installations. Also, when densepose starts to build, it searches for caffe2 in /usr/local/caffe2. That folder is empty, so I change

 RUN mv /usr/local/caffe2 /usr/local/caffe2_build

to

 RUN mv /pytorch/caffe2 /usr/local/caffe2_build.

However, I am a bit suspicious since when building caffe2, the build directory is removed after compilation. But in Densepose's dockerfile, the repository is called /usr/local/caffe2_build. It is not clear how one should change this part.

Nevertheless, the final docker image builds and the following command works:

nvidia-docker run --rm -it densepose:c2-cuda9-cudnn7 python2 detectron/tests/test_batch_permutation_op.py

Running the inference though, does not work.

python2 tools/infer_simple.py     --cfg configs/DensePose_ResNet101_FPN_s1x-e2e.yaml     --output-dir DensePoseData/infer_out/     --image-ext jpg     --wts https://dl.fbaipublicfiles.com/densepose/DensePose_ResNet101_FPN_s1x-e2e.pkl     DensePoseData/demo_data/demo_im.jpg

Results in

  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 236, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 197, in       CallWithExceptionIntercept
    return func(*args, **kwargs)
    RuntimeError: [enforce fail at conv_op_cudnn.cc:807] status == CUDNN_STATUS_SUCCESS. 8 vs 0. ,    Error at: /pytorch/caffe2/operators/conv_op_cudnn.cc:807:    CUDNN_STATUS_EXECUTION_FAILED
    Error from operator: 
    input: "gpu_0/data" input: "gpu_0/conv1_w" output: "gpu_0/conv1" name: "" type: "Conv" arg { name:     "kernel" i: 7 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 3 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 2 } device_option { device_type: 1 device_id: 0 } engine: "CUDNN"frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*) + 0x68 (0x7f27103cf7e8 in /usr/local/lib/libc10.so)
frame #1: <unknown function> + 0x103080e (0x7f268c7e880e in /usr/local/lib/libcaffe2_gpu.so)
frame #2: bool caffe2::CudnnConvOp::DoRunWithType<float, float, float, float>() + 0x85e (0x7f268c7effee in /usr/local/lib/libcaffe2_gpu.so)
frame #3: caffe2::CudnnConvOp::RunOnDevice() + 0xa0 (0x7f268c7e1fb0 in /usr/local/lib/libcaffe2_gpu.so)
frame #4: <unknown function> + 0xf70245 (0x7f268c728245 in /usr/local/lib/libcaffe2_gpu.so)
frame #5: caffe2::AsyncNetBase::run(int, int) + 0x154 (0x7f26f1a8ef44 in /usr/local/lib/libcaffe2.so)
frame #6: <unknown function> + 0x1338d25 (0x7f26f1a8ad25 in /usr/local/lib/libcaffe2.so)
frame #7: c10::ThreadPool::main_loop(unsigned long) + 0x2eb (0x7f26f0b4822b in /usr/local/lib/libcaffe2.so)
frame #8: <unknown function> + 0xb8c80 (0x7f2715b16c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #9: <unknown function> + 0x76ba (0x7f271bfbc6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #10: clone + 0x6d (0x7f271bcf241d in /lib/x86_64-linux-gnu/libc.so.6)

It seems like it does not find everything? In particular, I do not understand this:

Error at: /pytorch/caffe2/operators/conv_op_cudnn.cc:807:

While building the docker, caffe2 is moved from pytorch to caffe2_build so why is the error under pytorch?

Any ideas?

torpor29 commented 5 years ago

Did you check the PATH? bashrc file?

bylowerik commented 5 years ago

In the docker image the path is:

/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

How should it look like?

clw5180 commented 5 years ago

From this thread it is clear that many have problem with the new RTX 2080 card. I have done further investigations and found that one might need to update the docker file.

Here is a docker file for building caffe2, and there is the following line
-DCUDA_ARCH_BIN="35 52 60 61" \
-DCUDA_ARCH_PTX="61" \
RTX 2080 has computing capability 75 so I updated those lines to
-DCUDA_ARCH_BIN="35 52 60 61 75" \
-DCUDA_ARCH_PTX="75" \.
I then copy pasted the entire dockerfile for caffe2 into the dockerfile for Densepose and deleted the top row FROM caffe2/caffe2:snapshot-py2-cuda9.0-cudnn7-ubuntu16.04 in the Densepose dockerfile.

To make it build I needed to append typing to the pip installations. Also, when densepose starts to build, it searches for caffe2 in /usr/local/caffe2. That folder is empty, so I change
 RUN mv /usr/local/caffe2 /usr/local/caffe2_build
to
 RUN mv /pytorch/caffe2 /usr/local/caffe2_build.
However, I am a bit suspicious since when building caffe2, the build directory is removed after compilation. But in Densepose's dockerfile, the repository is called /usr/local/caffe2_build. It is not clear how one should change this part.

Nevertheless, the final docker image builds and the following command works:
nvidia-docker run --rm -it densepose:c2-cuda9-cudnn7 python2 detectron/tests/test_batch_permutation_op.py
Running the inference though, does not work.
python2 tools/infer_simple.py     --cfg configs/DensePose_ResNet101_FPN_s1x-e2e.yaml     --output-dir DensePoseData/infer_out/     --image-ext jpg     --wts https://dl.fbaipublicfiles.com/densepose/DensePose_ResNet101_FPN_s1x-e2e.pkl     DensePoseData/demo_data/demo_im.jpg
Results in
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 236, in RunNet
 StringifyNetName(name), num_iter, allow_fail,
File "/usr/local/lib/python2.7/dist-packages/caffe2/python/workspace.py", line 197, in       CallWithExceptionIntercept
 return func(*args, **kwargs)
 RuntimeError: [enforce fail at conv_op_cudnn.cc:807] status == CUDNN_STATUS_SUCCESS. 8 vs 0. ,    Error at: /pytorch/caffe2/operators/conv_op_cudnn.cc:807:    CUDNN_STATUS_EXECUTION_FAILED
 Error from operator: 
 input: "gpu_0/data" input: "gpu_0/conv1_w" output: "gpu_0/conv1" name: "" type: "Conv" arg { name:     "kernel" i: 7 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 3 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 2 } device_option { device_type: 1 device_id: 0 } engine: "CUDNN"frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*) + 0x68 (0x7f27103cf7e8 in /usr/local/lib/libc10.so)
frame #1: <unknown function> + 0x103080e (0x7f268c7e880e in /usr/local/lib/libcaffe2_gpu.so)
frame #2: bool caffe2::CudnnConvOp::DoRunWithType<float, float, float, float>() + 0x85e (0x7f268c7effee in /usr/local/lib/libcaffe2_gpu.so)
frame #3: caffe2::CudnnConvOp::RunOnDevice() + 0xa0 (0x7f268c7e1fb0 in /usr/local/lib/libcaffe2_gpu.so)
frame #4: <unknown function> + 0xf70245 (0x7f268c728245 in /usr/local/lib/libcaffe2_gpu.so)
frame #5: caffe2::AsyncNetBase::run(int, int) + 0x154 (0x7f26f1a8ef44 in /usr/local/lib/libcaffe2.so)
frame #6: <unknown function> + 0x1338d25 (0x7f26f1a8ad25 in /usr/local/lib/libcaffe2.so)
frame #7: c10::ThreadPool::main_loop(unsigned long) + 0x2eb (0x7f26f0b4822b in /usr/local/lib/libcaffe2.so)
frame #8: <unknown function> + 0xb8c80 (0x7f2715b16c80 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #9: <unknown function> + 0x76ba (0x7f271bfbc6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #10: clone + 0x6d (0x7f271bcf241d in /lib/x86_64-linux-gnu/libc.so.6)
It seems like it does not find everything? In particular, I do not understand this:
Error at: /pytorch/caffe2/operators/conv_op_cudnn.cc:807: 
While building the docker, caffe2 is moved from pytorch to caffe2_build so why is the error under pytorch?

Any ideas?

Same error at RTX2080. Hope Someone can help.

elkaps commented 5 years ago

I managed to run densepose on an RTX 2080ti using the following Dockerfile. I hope this helps!

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04

RUN apt-get -y update
RUN apt-get install -y --no-install-recommends \
      build-essential \
      git \
      libgoogle-glog-dev \
      libgtest-dev \
      libiomp-dev \
      libleveldb-dev \
      liblmdb-dev \
      libopencv-dev \
      libopenmpi-dev \
      libsnappy-dev \
      libprotobuf-dev \
      openmpi-bin \
      openmpi-doc \
      protobuf-compiler \
      python-dev \
      python-pip   
RUN pip install setuptools                       
RUN pip install --user \
      future \
      numpy \
      protobuf \
      typing \
      hypothesis
RUN apt-get install -y --no-install-recommends \
      libgflags-dev \
      cmake

RUN git clone --branch master --recursive https://github.com/pytorch/pytorch.git
RUN pip install typing pyyaml
WORKDIR /pytorch
RUN git submodule update --init --recursive
RUN python setup.py install

RUN git clone https://github.com/facebookresearch/densepose /densepose

# Install Python dependencies
RUN pip install -U pip
RUN pip install -r /densepose/requirements.txt

# Install the COCO API
RUN git clone https://github.com/cocodataset/cocoapi.git /cocoapi
WORKDIR /cocoapi/PythonAPI

ENV PYTHONPATH /usr/local
ENV Caffe2_DIR=/usr/local/lib/python2.7/dist-packages/torch/share/cmake/Caffe2/
ENV PYTHONPATH=${PYTHONPATH}:/pytorch/build
ENV LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}

ENV LD_LIBRARY_PATH=/usr/local/lib/python2.7/dist-packages/torch/lib/:${LD_LIBRARY_PATH}
ENV LIBRARY_PATH=/usr/local/lib/python2.7/dist-packages/torch/lib/:${LIBRARY_PATH}

ENV C_INCLUDE_PATH=/usr/local/lib/python2.7/dist-packages/torch/lib/include/:${C_INCLUDE_PATH}
ENV CPLUS_INCLUDE_PATH=/usr/local/lib/python2.7/dist-packages/torch/lib/include/:${CPLUS_INCLUDE_PATH}

ENV C_INCLUDE_PATH=/pytorch/:${C_INCLUDE_PATH}
ENV CPLUS_INCLUDE_PATH=/pytorch/:${CPLUS_INCLUDE_PATH}

ENV C_INCLUDE_PATH=/pytorch/build/:${C_INCLUDE_PATH}
ENV CPLUS_INCLUDE_PATH=/pytorch/build/:${CPLUS_INCLUDE_PATH}

ENV C_INCLUDE_PATH=/pytorch/torch/lib/include/:${C_INCLUDE_PATH}
ENV CPLUS_INCLUDE_PATH=/pytorch/torch/lib/include/:${CPLUS_INCLUDE_PATH}

RUN make install
WORKDIR /densepose

RUN make
RUN make ops

RUN apt-get -y update \
    && apt-get -y install \
        wget \
        software-properties-common

WORKDIR /densepose/DensePoseData
RUN bash get_densepose_uv.sh

WORKDIR /densepose

johannabar commented 5 years ago

@elkaps yes it helps, thanks!

Though I had to remove the RUN make ops to be able to build your Dockerfile.

Any ideas why this is working compared to the facebookresearch/DensePose Dockerfile? The main difference I see is that you are compiling pytorch from source instead of using conda.

elkaps commented 5 years ago

@johannabar Yes, the dockerfile uses cuda 10 instead of cuda 8 and compiles pytorch from source using setup.py

yuyou commented 4 years ago

RUN pip install setuptools
RUN pip install --user \ future \ numpy==1.14.0 \ protobuf==3.5.1 \ typing \ hypothesis

The build failed untill specific versions are provided.

joel-simon commented 4 years ago

For me with ubuntu 18.04, 2080ti and cuda 10 the following pip install versions were needed for it to work: pyYAML==3.12 numpy==1.14.0 protobuf==3.11.1

getarobo commented 4 years ago

For me with ubuntu 18.04, 2080ti and cuda 10 the following pip install versions were needed for it to work: pyYAML==3.12 numpy==1.14.0 protobuf==3.11.1 Hey @joel-simon im on ubuntu 18.04 and rtx2080, im still having trouble dockfile on "make ops", can you share your Dockerfile or share how you installed?

facebookresearch / DensePose

Further investigation with RTX2080 #184