facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.3k stars 2.49k forks source link

Docker Runtime Error: Not Compiled with GPU support #167

Open archdyn opened 5 years ago

archdyn commented 5 years ago

❓ Questions and Help

Hello,

i have a strange Problem with the Docker Image. When I build the Docker Image given the instructions in INSTALL.md and if I then try training on the coco2014 dataset with the command below I get RuntimeError: Not compiled with GPU support(nms at ./maskrcnn_benchmakr/csrc/nms.h:22)

nvidia-docker run --shm-size=8gb -v /home/archdyn/Datasets/coco:/maskrcnn-benchmark/datasets/coco maskrcnn-benchmark python /maskrcnn-benchmark/tools/train_net.py --config-file "/maskrcnn-benchmark/configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 1 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

But whhen I change the Dockerfile and comment the line python setup.py build develop before WORKDIR /maskrcnn-benchmark out and then execute the line python setup.py build develop inside my built docker container i can train without problems.

My Environment when running the Docker Container:

2018-11-17 20:03:13,889 maskrcnn_benchmark INFO: Collecting env info (might take some time)
2018-11-17 20:03:15,634 maskrcnn_benchmark INFO: 
PyTorch version: 1.0.0.dev20181116
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: GPU 0: GeForce GTX 850M
Nvidia driver version: 410.73
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.1
/usr/lib/x86_64-linux-gnu/libcudnn_static_v7.a

Versions of relevant libraries:
[pip] Could not collect
[conda] pytorch-nightly           1.0.0.dev20181116 py3.6_cuda9.0.176_cudnn7.1.2_0    pytorch
        Pillow (5.3.0)

Does somebody know why this problem happens?

fmassa commented 5 years ago

cc @miguelvr do you know what this might be?

miguelvr commented 5 years ago

@archdyn can you paste your build command?

miguelvr commented 5 years ago

Ahhh, I see that your GPU is a GTX 850M... I'm not sure if the pytorch 1.0 wheel is compatible with that GPU. @fmassa might be able to confirm that

fmassa commented 5 years ago

It should be fine I think.

Also, pytorch installation seems to be fine, but the error appears when he tries to install maskrcnn-benchmark

miguelvr commented 5 years ago

Well, now I'm running the exact same docker file as before and I'm getting another runtime error:

Traceback (most recent call last):
  File "tools/train_net.py", line 16, in <module>
    from maskrcnn_benchmark.engine.inference import inference
  File "/maskrcnn-benchmark/maskrcnn_benchmark/engine/inference.py", line 20, in <module>
    from maskrcnn_benchmark.structures.boxlist_ops import boxlist_iou
  File "/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 6, in <module>
    from maskrcnn_benchmark.layers import nms as _box_nms
  File "/maskrcnn-benchmark/maskrcnn_benchmark/layers/__init__.py", line 8, in <module>
    from .nms import nms
  File "/maskrcnn-benchmark/maskrcnn_benchmark/layers/nms.py", line 3, in <module>
Traceback (most recent call last):
    from maskrcnn_benchmark import _C
  File "tools/train_net.py", line 16, in <module>
ImportError: /maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at4cuda6detail17CUDAStream_streamEP19CUDAStreamInternals

The dockerfile install the libraries from their respective github master... Maybe something changed since?

miguelvr commented 5 years ago

Well, now I'm running the exact same docker file as before and I'm getting another runtime error:

Traceback (most recent call last):
  File "tools/train_net.py", line 16, in <module>
    from maskrcnn_benchmark.engine.inference import inference
  File "/maskrcnn-benchmark/maskrcnn_benchmark/engine/inference.py", line 20, in <module>
    from maskrcnn_benchmark.structures.boxlist_ops import boxlist_iou
  File "/maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 6, in <module>
    from maskrcnn_benchmark.layers import nms as _box_nms
  File "/maskrcnn-benchmark/maskrcnn_benchmark/layers/__init__.py", line 8, in <module>
    from .nms import nms
  File "/maskrcnn-benchmark/maskrcnn_benchmark/layers/nms.py", line 3, in <module>
Traceback (most recent call last):
    from maskrcnn_benchmark import _C
  File "tools/train_net.py", line 16, in <module>
ImportError: /maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at4cuda6detail17CUDAStream_streamEP19CUDAStreamInternals

The dockerfile install the libraries from their respective github master... Maybe something changed since?

Nevermind this, I was mapping the csrc/ folder by mistake...

It is working fine for me now...

miguelvr commented 5 years ago

@archdyn what is your local CUDA version?

Please run nvcc --version

and check if it matches the version that appears in the docker... These have to match

archdyn commented 5 years ago

I have a second PC at my disposal where this problem also occurs. On the second Pc I have a Nvidia V100. The Output of the second PC is below.

nvcc -V:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

nvidia-smi:

Mon Nov 19 13:22:50 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:60:00.0 Off |                  Off |
| N/A   32C    P0    37W / 250W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

command to build docker image:

nvidia-docker build -t maskrcnn-benchmark --build-arg CUDA=9.1 --build-arg CUDNN=7 docker/

PS: The order of the build command in INSTALL.md seems to be wrong. In the INSTALL.md the order is the following but this gives an error:

nvidia-docker build -t --build-arg CUDA=9.2 --build-arg CUDNN=7 maskrcnn-benchmark docker/

When building maskrcnn-benchmark the following Warning occurs but the build process continues and succeeds: No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'

But when I then train with following command I get the Runtime Error:

nvidia-docker run --shm-size=8gb -v /home/archdyn/Datasets/coco:/maskrcnn-benchmark/datasets/coco maskrcnn-benchmark python /maskrcnn-benchmark/tools/train_net.py --config-file "/maskrcnn-benchmark/configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 1 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1

The only way to train and prevent the Runtime Error is to modify the Dockerfile and build it like:

ARG CUDA="9.0"
ARG CUDNN="7"

FROM nvidia/cuda:${CUDA}-cudnn${CUDNN}-devel-ubuntu16.04

RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

# install basics
RUN apt-get update -y \
 && apt-get install -y apt-utils git curl ca-certificates bzip2 cmake tree htop bmon iotop g++

# Install Miniconda
RUN curl -so /miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
 && chmod +x /miniconda.sh \
 && /miniconda.sh -b -p /miniconda \
 && rm /miniconda.sh

ENV PATH=/miniconda/bin:$PATH

# Create a Python 3.6 environment
RUN /miniconda/bin/conda install -y conda-build \
 && /miniconda/bin/conda create -y --name py36 python=3.6.7 \
 && /miniconda/bin/conda clean -ya

ENV CONDA_DEFAULT_ENV=py36
ENV CONDA_PREFIX=/miniconda/envs/$CONDA_DEFAULT_ENV
ENV PATH=$CONDA_PREFIX/bin:$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false

RUN conda install -y ipython
RUN pip install ninja yacs cython matplotlib

# Install PyTorch 1.0 Nightly
RUN conda install -y pytorch-nightly -c pytorch && conda clean -ya

# Install TorchVision master
RUN git clone https://github.com/pytorch/vision.git \
 && cd vision \
 && python setup.py install

# install pycocotools
RUN git clone https://github.com/cocodataset/cocoapi.git \
 && cd cocoapi/PythonAPI \
 && python setup.py build_ext install

# install PyTorch Detection
RUN git clone https://github.com/facebookresearch/maskrcnn-benchmark.git \

WORKDIR /maskrcnn-benchmark

Then after the build I have to go inside the docker container:

nvidia-docker run --rm -it maskrcnn-benchmark bash

And inside the docker container I build maskrcnn-benchmark without problems:

python setup.py build develop

I then have to commit this modified docker container so that I have a Docker Image that can always be started:

docker commit [Container ID] maskrcnn-benchmark:working

After all these steps I can train without problems with:

nvidia-docker run --shm-size=8gb -v /home/archdyn/Datasets/coco:/maskrcnn-benchmark/datasets/coco maskrcnn-benchmark:working python /maskrcnn-benchmark/tools/train_net.py --config-file "/maskrcnn-benchmark/configs/e2e_mask_rcnn_R_50_FPN_1x.yaml" SOLVER.IMS_PER_BATCH 1 SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" TEST.IMS_PER_BATCH 1
miguelvr commented 5 years ago

the only possible difference that I see is that you are using CUDA 9.1... For some reason nvidia-docker is not detecting your CUDA installation...

is CUDA_HOME set?

archdyn commented 5 years ago

Setting CUDA_HOME locally and inside the Dockerfile didnt help. Maybe the Runtime Error occurs because my local CUDA is 9.1, Docker CUDA is 9.1 but conda install pytorch-nightly -c pytorch is using CUDA 9.0. It seems that pytorch installation with conda comes with its own CUDA.

https://discuss.pytorch.org/t/is-it-possible-to-link-conda-binary-to-cuda-in-non-standard-location/6863 https://discuss.pytorch.org/t/pytorch-and-cuda-9-1/13126/14

In the build_anaconda.sh script for pytorch the versions 8.0, 9.0 and 9.1 are supported as it seems: https://github.com/pytorch/pytorch/blob/master/scripts/build_anaconda.sh

I will try tomorrow to change the line conda install pytorch-nightly -c pytorch to conda install pytorch-nightly cuda91 -c pytorch and see what happens.

Edit: I changed the pytorch installation today to conda install pytorch-nightly cuda91 -c pytorch so my local CUDA is 9.1, Docker CUDA is 9.1 and the pytorch installation should have CUDA9.1 as well. I also set CUDA_HOME locally and inside the Docker Container. I am still getting the Runtime Error and I dont know why nvidia-docker doesnt find my CUDA Runtime. Since it works when I build maskrcnn-benchmark inside my Docker Container and then commit this Docker Container it doesnt matter that much.

zimenglan-sysu-512 commented 5 years ago

hi @archdyn i met the same problem as u did RuntimeError: Not compiled with GPU support (nms at /algo_code/maskrcnn_benchmark/csrc/nms.h:22), how do u solved it? btw, i use docker instead of nvidia-docker. thanks.

fmassa commented 5 years ago

@zimenglan-sysu-512 the dockerfile that we provide only works with nvidia-docker, and won't work with docker.

zimenglan-sysu-512 commented 5 years ago

hi @fmassa i use nvidia-docker to build the image, and the then use nvidia-docker to run, it still encounters this problem. i found that when run the command sudo python3.6 setup.py build develop, the torch.cuda.is_available() is false. do u have any idea to solve it? thanks

archdyn commented 5 years ago

You could try what worked for me. Take the following Dockerfile and build it with nvidia-docker. The Dockerfile has just 2 lines removed under install pytorch detection.

ARG CUDA="9.0"
ARG CUDNN="7"

FROM nvidia/cuda:${CUDA}-cudnn${CUDNN}-devel-ubuntu16.04

RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

# install basics
RUN apt-get update -y \
 && apt-get install -y apt-utils git curl ca-certificates bzip2 cmake tree htop bmon iotop g++

# Install Miniconda
RUN curl -so /miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
 && chmod +x /miniconda.sh \
 && /miniconda.sh -b -p /miniconda \
 && rm /miniconda.sh

ENV PATH=/miniconda/bin:$PATH

# Create a Python 3.6 environment
RUN /miniconda/bin/conda install -y conda-build \
 && /miniconda/bin/conda create -y --name py36 python=3.6.7 \
 && /miniconda/bin/conda clean -ya

ENV CONDA_DEFAULT_ENV=py36
ENV CONDA_PREFIX=/miniconda/envs/$CONDA_DEFAULT_ENV
ENV PATH=$CONDA_PREFIX/bin:$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false

RUN conda install -y ipython
RUN pip install ninja yacs cython matplotlib

# Install PyTorch 1.0 Nightly
RUN conda install -y pytorch-nightly -c pytorch && conda clean -ya

# Install TorchVision master
RUN git clone https://github.com/pytorch/vision.git \
 && cd vision \
 && python setup.py install

# install pycocotools
RUN git clone https://github.com/cocodataset/cocoapi.git \
 && cd cocoapi/PythonAPI \
 && python setup.py build_ext install

# install PyTorch Detection
RUN git clone https://github.com/facebookresearch/maskrcnn-benchmark.git

WORKDIR /maskrcnn-benchmark

After that go into the docker container with the following command: nvidia-docker run --rm -it maskrcnn-benchmark bash Now inside the docke container execute this command: python setup.py build develop Now you have to exit the docker container with CTRL + p and after that CTRL + q. This should get you out of the docker container without stopping it. Alternatively you could just open a new console. From the console now execute this: docker commit <CONTAINER ID> maskrcnn-benchmark After all this you should have a working docker image without this Runtime Error. At least for me this worked.

miguelvr commented 5 years ago

@archdyn that's super weird because you are just installing maskrcnn later...

miguelvr commented 5 years ago

@fmassa I noticed this line in the dockerfile: echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

I didn't add it... Can you explain me its purpuse?

miguelvr commented 5 years ago

@archdyn @zimenglan-sysu-512 please make sure you have the correct version of docker-ce and nvidia docker installed: https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#which-docker-packages-are-supported

zimenglan-sysu-512 commented 5 years ago

hi @archdyn i find that if go into the built image and then run python3.6 setup.py build develop, it can successfully build the maskrcnn-benchmark. it is so weird.

hi @miguelvr i use cuda-9.2 + cudnn-v7.4 + nvidia-driver-396, and use nvidia-docker-18.09

miguelvr commented 5 years ago

hi @miguelvr i use cuda-9.2 + cudnn-v7.4 + nvidia-driver-396, and use nvidia-docker-18.09

I think this is only happening to people that are using versions of CUDA different than the defaults in the dockerfile... I wasn't able to test with those as I don't have them installed in my machine.

Anyway, can you run this nvidia-docker run -it nvidia/cuda:${CUDA}-cudnn${CUDNN}-devel-ubuntu16.04 nvidia-smi (replace ${CUDA} and ${CUDNN} with the build args that you used)

and check if it returns all your GPU information

zimenglan-sysu-512 commented 5 years ago

hi @miguelvr i run the command to ouput this as below:

sudo nvidia-docker run -it nvidia/cuda:9.2-cudnn7-devel-ubuntu16.04 nvidia-smi                                                             
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:04:00.0 Off |                  N/A |
| 22%   45C    P0    72W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 00000000:05:00.0 Off |                  N/A |
| 22%   53C    P0    77W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 00000000:08:00.0 Off |                  N/A |
| 22%   53C    P0    71W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 00000000:09:00.0 Off |                  N/A |
| 22%   53C    P0    75W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX TIT...  Off  | 00000000:84:00.0 Off |                  N/A |
| 22%   51C    P0    74W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX TIT...  Off  | 00000000:85:00.0 Off |                  N/A |
| 22%   49C    P0    73W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX TIT...  Off  | 00000000:88:00.0 Off |                  N/A |
| 22%   47C    P0    72W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX TIT...  Off  | 00000000:89:00.0 Off |                  N/A |
| 22%   46C    P0    67W / 250W |      0MiB / 12212MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
fmassa commented 5 years ago

@miguelvr those lines were added in https://github.com/facebookresearch/maskrcnn-benchmark/pull/165 Maybe @keineahnung2345 can add a bit more details on them

keineahnung2345 commented 5 years ago

@miguelvr This line is used to avoid the warning given by debconf. Ref: https://github.com/phusion/baseimage-docker/issues/58.

zimenglan-sysu-512 commented 5 years ago

hi @archdyn @fmassa @miguelvr when i use the nvidia-docker to build the image, torch.cuda.is_available() returns False. and then after building, i use nvidia-docker to run the image, torch.cuda.is_available() returns True. it is so weird.

with the help of my friend to debug it, finally we find that if we change the line as below:

# if torch.cuda.is_available() and CUDA_HOME is not None:
if CUDA_HOME is not None:

it can solve the problem "RuntimeError: Not compiled with GPU support (nms at /algo_code/maskrcnn_benchmark/csrc/nms.h:22)", when run the image.

although i have no idea why torch.cuda.is_available() return False in building time.

thanks.

fmassa commented 5 years ago

I have no idea either... but if I remove the if torch.cuda.is_available(), I'm afraid that compilation could fail if the user has for some reason installed a few things from CUDA in their system (but not a nvcc)

obendidi commented 5 years ago

Hi, any updates on a potential fix ? For now my workaround is to build the project each time at run-time :

nvidia-docker run --ipc=host  \
            -v /home/archdyn/Datasets/coco:/maskrcnn-benchmark/datasets/coco \
          --rm -it maskrcnn-benchmark:latest \
          bash -c "python setup.py build develop && \
                python tools/train_net.py --config-file "config.yaml"
miguelvr commented 5 years ago

@bendidi a better workaround would be to build the image, run a container with bash and install the packages. Then leave the container without exiting it, and create an image from that container with docker commit

obendidi commented 5 years ago

I agree with you @miguelvr , but what I'm trying to make is a docker file that I can deploy automatically on multiple machines and start training/inference automatically and not have to do the manip of docker commit manually each time, another possible workaround would be to :

Add an option in python setup.py build develop to force install for cuda even if torch.cuda.is_available() returns False , or an env var maybe like FORCE_CUDA=1 python setup.py build develop

fmassa commented 5 years ago

Will this actually solve the problem for you? Maybe you could try applying a patch to the repo after cloning it in your Dockerfile as a temporary solution?

miguelvr commented 5 years ago

The weird part is that it works for CUDA 9.0 but not for CUDA 9.1 or 9.2

obendidi commented 5 years ago

@miguelvr it doesn't work for me with cuda 9.0, tested with :

FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04

RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

# install basics
RUN apt-get update -y \
 && apt-get install -y apt-utils build-essential \
                      git curl ca-certificates \
                      bzip2 cmake tree htop \
                      bmon iotop g++ libglib2.0-0

# Install Miniconda
RUN curl -so /miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
 && chmod +x /miniconda.sh \
 && /miniconda.sh -b -p /miniconda \
 && rm /miniconda.sh

ENV PATH=/miniconda/bin:$PATH

# Create a Python 3.6 environment
RUN /miniconda/bin/conda install -y conda-build \
 && /miniconda/bin/conda create -y --name py36 python=3.6.7 pip \
 && /miniconda/bin/conda clean -ya

ENV CONDA_DEFAULT_ENV=py36
ENV CONDA_PREFIX=/miniconda/envs/$CONDA_DEFAULT_ENV
ENV PATH=$CONDA_PREFIX/bin:$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false

# Install PyTorch 1.0 Nightly
RUN pip install ninja yacs matplotlib opencv-python==3.2.0.6
RUN pip install pyyaml Cython==0.29.1 numpy==1.15
RUN pip install pycocotools==2.0.0

RUN conda install -y pytorch-nightly=1.0.0.dev20181127 -c pytorch && conda clean -ya
RUN python -c "import torch; print(torch.cuda.is_available())"

and result is :

Step 24/24 : RUN python -c "import torch; print(torch.cuda.is_available())"
 ---> Running in b8a5b5f49b01
False

@fmassa yes I'll try that , thanks !

miguelvr commented 5 years ago

@bendidi the CUDA version in your machine must match the CUDA version in the docker. If you don't have CUDA 9.0 in your machine hosting the container it won't work

obendidi commented 5 years ago

A temporary solution as @fmassa said :

# install PyTorch Detection
RUN git clone https://github.com/facebookresearch/maskrcnn-benchmark.git \
 && cd maskrcnn-benchmark \
 && sed -i -e 's/torch.cuda.is_available()/True/g' setup.py \
 && python setup.py build develop \
 && sed -i -e 's/True/torch.cuda.is_available()/g' setup.py 
miguelvr commented 5 years ago

@bendidi @zimenglan-sysu-512 could you try replacing the pytorch installation line in the Dockerfile

https://github.com/facebookresearch/maskrcnn-benchmark/blob/ced10f205d9f1b7b8b1d7968e1c9910e020a0165/docker/Dockerfile#L35-L36

by this:

RUN conda install -y pytorch-nightly cuda92 -c pytorch \
  && conda clean -ya

Use the CUDA version corresponding to your host machine (CUDA 9.2 in this example).

zimenglan-sysu-512 commented 5 years ago

hi @miguelvr i did use cuda92. it still met this problem.

HardSoft2023 commented 5 years ago

You could try what worked for me. Take the following Dockerfile and build it with nvidia-docker. The Dockerfile has just 2 lines removed under install pytorch detection.

ARG CUDA="9.0"
ARG CUDNN="7"

FROM nvidia/cuda:${CUDA}-cudnn${CUDNN}-devel-ubuntu16.04

RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

# install basics
RUN apt-get update -y \
 && apt-get install -y apt-utils git curl ca-certificates bzip2 cmake tree htop bmon iotop g++

# Install Miniconda
RUN curl -so /miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \
 && chmod +x /miniconda.sh \
 && /miniconda.sh -b -p /miniconda \
 && rm /miniconda.sh

ENV PATH=/miniconda/bin:$PATH

# Create a Python 3.6 environment
RUN /miniconda/bin/conda install -y conda-build \
 && /miniconda/bin/conda create -y --name py36 python=3.6.7 \
 && /miniconda/bin/conda clean -ya

ENV CONDA_DEFAULT_ENV=py36
ENV CONDA_PREFIX=/miniconda/envs/$CONDA_DEFAULT_ENV
ENV PATH=$CONDA_PREFIX/bin:$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false

RUN conda install -y ipython
RUN pip install ninja yacs cython matplotlib

# Install PyTorch 1.0 Nightly
RUN conda install -y pytorch-nightly -c pytorch && conda clean -ya

# Install TorchVision master
RUN git clone https://github.com/pytorch/vision.git \
 && cd vision \
 && python setup.py install

# install pycocotools
RUN git clone https://github.com/cocodataset/cocoapi.git \
 && cd cocoapi/PythonAPI \
 && python setup.py build_ext install

# install PyTorch Detection
RUN git clone https://github.com/facebookresearch/maskrcnn-benchmark.git

WORKDIR /maskrcnn-benchmark

After that go into the docker container with the following command: nvidia-docker run --rm -it maskrcnn-benchmark bash Now inside the docke container execute this command: python setup.py build develop Now you have to exit the docker container with CTRL + p and after that CTRL + q. This should get you out of the docker container without stopping it. Alternatively you could just open a new console. From the console now execute this: docker commit <CONTAINER ID> maskrcnn-benchmark After all this you should have a working docker image without this Runtime Error. At least for me this worked.

This is worked for me.. I have test.

sshuair commented 5 years ago

@GuoLiuFang works for me. thanks.

IssamLaradji commented 5 years ago

any updates on this for pytorch 1.0?

fmassa commented 5 years ago

@IssamLaradji do you still see an issue here?

IssamLaradji commented 5 years ago

thanks for your reply. Yes, I get

ipdb> nms(boxes, scores, nms_threshold)                                                                           
*** RuntimeError: Not compiled with GPU support (nms at /maskrcnn-benchmark/maskrcnn_benchmark/csrc/nms.h:22)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f6c6d9a3cc5 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: nms(at::Tensor const&, at::Tensor const&, float) + 0xd4 (0x7f6c5bc69404 in /maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0x15717 (0x7f6c5bc75717 in /maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x1580e (0x7f6c5bc7580e in /maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x126f5 (0x7f6c5bc726f5 in /maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>

After installing it on Docker as:

RUN conda install pytorch torchvision cuda92 -c pytorch

RUN git clone https://github.com/facebookresearch/maskrcnn-benchmark.git \
 && cd maskrcnn-benchmark \
 && python setup.py build develop && cd 

Thanks in advance!

fmassa commented 5 years ago

@IssamLaradji did you try using https://github.com/facebookresearch/maskrcnn-benchmark/issues/167#issuecomment-448812589 to see if it works for you?

IssamLaradji commented 5 years ago

Unfortunately, I can't use that fix in my case :(, because I have to launch multiple experiments based on the docker to multiple machines. That is, I can't manually access a docker instance as the experiment has to run automatically. :(

fmassa commented 5 years ago

I don't know why during the image creation no GPUs can be seen, this might be an issue with nvidia-docker, but I don't know more about it because I don't often use docker :-/

obendidi commented 5 years ago

@IssamLaradji I had the same problem and my solution for that was to modify the docker file a little bit :

# install PyTorch Detection
RUN git clone https://github.com/facebookresearch/maskrcnn-benchmark.git \
 && cd maskrcnn-benchmark \
 && sed -i -e 's/torch.cuda.is_available()/True/g' setup.py \
 && python setup.py build develop \
 && sed -i -e 's/True/torch.cuda.is_available()/g' setup.py 

it's a quick dirty hack , but it gets the job done :)

IssamLaradji commented 5 years ago

Thanks, but i get this with that command,

/usr/local/cuda/bin/nvcc -DWITH_CUDA -I/maskrcnn-benchmark/maskrcnn_benchmark/csrc -I/opt/conda/lib/python3.7/site-packages/torch/lib/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -c /maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/nms.cu -o build/temp.linux-x86_64-3.7/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/nms.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --compiler-options '-fPIC' -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1
fmassa commented 5 years ago

Isn't there any more error message there? If not, it would be helpful if you could jump into the docker container and try compiling it from there, so that we can know the error message.

IssamLaradji commented 5 years ago

Sorry, this is the full error, which found cuda but it didn't work :(:

Step 52/53 : RUN git clone https://github.com/facebookresearch/maskrcnn-benchmark.git  && cd maskrcnn-benchmark  && sed -i -e 's/torch.cuda.is_available()/True/g' setup.py  && CUDAHOSTCXX=/usr/bin/gcc-5 python setup.py build develop  && sed -i -e 's/True/torch.cuda.is_available()/g' setup.py
 ---> Running in e2e733b95db4
^[[91mCloning into 'maskrcnn-benchmark'...
^[[0mNo CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark
copying maskrcnn_benchmark/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/config
copying maskrcnn_benchmark/config/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/config
copying maskrcnn_benchmark/config/defaults.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/config
copying maskrcnn_benchmark/config/paths_catalog.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/config
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/solver
copying maskrcnn_benchmark/solver/build.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/solver
copying maskrcnn_benchmark/solver/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/solver
copying maskrcnn_benchmark/solver/lr_scheduler.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/solver
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling
copying maskrcnn_benchmark/modeling/registry.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling
copying maskrcnn_benchmark/modeling/utils.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling
copying maskrcnn_benchmark/modeling/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling
copying maskrcnn_benchmark/modeling/matcher.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling
copying maskrcnn_benchmark/modeling/balanced_positive_negative_sampler.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling
copying maskrcnn_benchmark/modeling/poolers.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling
copying maskrcnn_benchmark/modeling/box_coder.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/structures
copying maskrcnn_benchmark/structures/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/structures
copying maskrcnn_benchmark/structures/bounding_box.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/structures
copying maskrcnn_benchmark/structures/image_list.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/structures
copying maskrcnn_benchmark/structures/segmentation_mask.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/structures
copying maskrcnn_benchmark/structures/boxlist_ops.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/structures
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/logger.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/checkpoint.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/registry.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/c2_model_loading.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/env.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/model_zoo.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/comm.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/collect_env.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/metric_logger.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/model_serialization.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/imports.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
copying maskrcnn_benchmark/utils/miscellaneous.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/utils
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data
copying maskrcnn_benchmark/data/build.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data
copying maskrcnn_benchmark/data/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data
copying maskrcnn_benchmark/data/collate_batch.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/layers
copying maskrcnn_benchmark/layers/roi_pool.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/layers
copying maskrcnn_benchmark/layers/smooth_l1_loss.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/layers
copying maskrcnn_benchmark/layers/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/layers
copying maskrcnn_benchmark/layers/misc.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/layers
copying maskrcnn_benchmark/layers/nms.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/layers
copying maskrcnn_benchmark/layers/roi_align.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/layers
copying maskrcnn_benchmark/layers/batch_norm.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/layers
copying maskrcnn_benchmark/layers/_utils.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/layers
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/engine
copying maskrcnn_benchmark/engine/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/engine
copying maskrcnn_benchmark/engine/inference.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/engine
copying maskrcnn_benchmark/engine/trainer.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/engine
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/backbone
copying maskrcnn_benchmark/modeling/backbone/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/backbone
copying maskrcnn_benchmark/modeling/backbone/backbone.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/backbone
copying maskrcnn_benchmark/modeling/backbone/resnet.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/backbone
copying maskrcnn_benchmark/modeling/backbone/fpn.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/backbone
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/rpn
copying maskrcnn_benchmark/modeling/rpn/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/rpn
copying maskrcnn_benchmark/modeling/rpn/anchor_generator.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/rpn
copying maskrcnn_benchmark/modeling/rpn/loss.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/rpn
copying maskrcnn_benchmark/modeling/rpn/inference.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/rpn
copying maskrcnn_benchmark/modeling/rpn/rpn.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/rpn
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads
copying maskrcnn_benchmark/modeling/roi_heads/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads
copying maskrcnn_benchmark/modeling/roi_heads/roi_heads.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/detector
copying maskrcnn_benchmark/modeling/detector/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/detector
copying maskrcnn_benchmark/modeling/detector/generalized_rcnn.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/detector
copying maskrcnn_benchmark/modeling/detector/detectors.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/detector
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/box_head
copying maskrcnn_benchmark/modeling/roi_heads/box_head/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/box_head
copying maskrcnn_benchmark/modeling/roi_heads/box_head/roi_box_predictors.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/box_head
copying maskrcnn_benchmark/modeling/roi_heads/box_head/loss.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/box_head
copying maskrcnn_benchmark/modeling/roi_heads/box_head/inference.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/box_head
copying maskrcnn_benchmark/modeling/roi_heads/box_head/roi_box_feature_extractors.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/box_head
copying maskrcnn_benchmark/modeling/roi_heads/box_head/box_head.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/box_head
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/mask_head
copying maskrcnn_benchmark/modeling/roi_heads/mask_head/roi_mask_predictors.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/mask_head
copying maskrcnn_benchmark/modeling/roi_heads/mask_head/roi_mask_feature_extractors.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/mask_head
copying maskrcnn_benchmark/modeling/roi_heads/mask_head/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/mask_head
copying maskrcnn_benchmark/modeling/roi_heads/mask_head/loss.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/mask_head
copying maskrcnn_benchmark/modeling/roi_heads/mask_head/inference.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/mask_head
copying maskrcnn_benchmark/modeling/roi_heads/mask_head/mask_head.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/modeling/roi_heads/mask_head
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/samplers
copying maskrcnn_benchmark/data/samplers/distributed.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/samplers
copying maskrcnn_benchmark/data/samplers/iteration_based_batch_sampler.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/samplers
copying maskrcnn_benchmark/data/samplers/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/samplers
copying maskrcnn_benchmark/data/samplers/grouped_batch_sampler.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/samplers
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/transforms
copying maskrcnn_benchmark/data/transforms/build.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/transforms
copying maskrcnn_benchmark/data/transforms/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/transforms
copying maskrcnn_benchmark/data/transforms/transforms.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/transforms
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets
copying maskrcnn_benchmark/data/datasets/concat_dataset.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets
copying maskrcnn_benchmark/data/datasets/list_dataset.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets
copying maskrcnn_benchmark/data/datasets/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets
copying maskrcnn_benchmark/data/datasets/voc.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets
copying maskrcnn_benchmark/data/datasets/coco.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets/evaluation
copying maskrcnn_benchmark/data/datasets/evaluation/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets/evaluation
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets/evaluation/coco
copying maskrcnn_benchmark/data/datasets/evaluation/coco/coco_eval.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets/evaluation/coco
copying maskrcnn_benchmark/data/datasets/evaluation/coco/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets/evaluation/coco
creating build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets/evaluation/voc
copying maskrcnn_benchmark/data/datasets/evaluation/voc/voc_eval.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets/evaluation/voc
copying maskrcnn_benchmark/data/datasets/evaluation/voc/__init__.py -> build/lib.linux-x86_64-3.7/maskrcnn_benchmark/data/datasets/evaluation/voc
running build_ext
building 'maskrcnn_benchmark._C' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/maskrcnn-benchmark
creating build/temp.linux-x86_64-3.7/maskrcnn-benchmark/maskrcnn_benchmark
creating build/temp.linux-x86_64-3.7/maskrcnn-benchmark/maskrcnn_benchmark/csrc
creating build/temp.linux-x86_64-3.7/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cpu
creating build/temp.linux-x86_64-3.7/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA -I/maskrcnn-benchmark/maskrcnn_benchmark/csrc -I/opt/conda/lib/python3.7/site-packages/torch/lib/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -c /maskrcnn-benchmark/maskrcnn_benchmark/csrc/vision.cpp -o build/temp.linux-x86_64-3.7/maskrcnn-benchmark/maskrcnn_benchmark/csrc/vision.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
^[[91mcc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
^[[0mgcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA -I/maskrcnn-benchmark/maskrcnn_benchmark/csrc -I/opt/conda/lib/python3.7/site-packages/torch/lib/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -c /maskrcnn-benchmark/maskrcnn_benchmark/csrc/cpu/nms_cpu.cpp -o build/temp.linux-x86_64-3.7/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cpu/nms_cpu.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
^[[91mcc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
^[[0mgcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DWITH_CUDA -I/maskrcnn-benchmark/maskrcnn_benchmark/csrc -I/opt/conda/lib/python3.7/site-packages/torch/lib/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -c /maskrcnn-benchmark/maskrcnn_benchmark/csrc/cpu/ROIAlign_cpu.cpp -o build/temp.linux-x86_64-3.7/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cpu/ROIAlign_cpu.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
^[[91mcc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
^[[0m/usr/local/cuda/bin/nvcc -DWITH_CUDA -I/maskrcnn-benchmark/maskrcnn_benchmark/csrc -I/opt/conda/lib/python3.7/site-packages/torch/lib/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/opt/conda/include/python3.7m -c /maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/nms.cu -o build/temp.linux-x86_64-3.7/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/nms.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --compiler-options '-fPIC' -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
^[[91merror: command '/usr/local/cuda/bin/nvcc' failed with exit status 1
^[[0mThe push refers to repository [images.borgy.elementai.lan/issam.laradji/v1]
fmassa commented 5 years ago

@IssamLaradji this line is suspicious

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda

apart from that, I don't have any other idea what it might be

lham commented 5 years ago

I get the Not compiled with GPU support error as well, but I can make it work using sed-workaround from @bendidi .

My own conclusions are that we see the same problem as discussed here https://github.com/NVIDIA/nvidia-docker/issues/225 (Which seems to be working as intended? But then I don't understand how it could work in some cases here at all). Running the following docker-file outputs false on the last statement, hence why maskrcnn-benchmark is compiled without GPU support later. I haven't tried installing using conda. I tried to compiled pytorch from scratch but that didn't help either.

FROM nvidia/cuda:9.2-cudnn7-devel-ubuntu18.04

RUN apt-get update && apt-get install -y --no-install-recommends \
 vim \
 git \
 python3.6-dev \
 python3-pip \
 && rm -rf /var/lib/apt/lists/*

RUN rm -f /usr/bin/python && ln -s /usr/bin/python3.6 /usr/bin/python
RUN rm -f /usr/bin/pip && ln -s /usr/bin/pip3 /usr/bin/pip

RUN pip install -U pip
RUN pip install numpy
RUN pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu90/torch_nightly.html

RUN python -c "import torch; print(torch.cuda.is_available());"
fmassa commented 5 years ago

Cool, thanks for the info and the reference @lham !

miguelvr commented 5 years ago

@fmassa if this is just a matter of torch.is_cuda_available() not working, can wer simply add another flag to force a CUDA instalation and bypass it?

It should be as simple as changing a single line in setup.py:

    if (torch.cuda.is_available() and CUDA_HOME is not None) or FORCE_CUDA:
        extension = CUDAExtension
        sources += source_cuda
        define_macros += [("WITH_CUDA", None)]
        extra_compile_args["nvcc"] = [
            "-DCUDA_HAS_FP16=1",
            "-D__CUDA_NO_HALF_OPERATORS__",
            "-D__CUDA_NO_HALF_CONVERSIONS__",
            "-D__CUDA_NO_HALF2_OPERATORS__",
        ]