Can't setup the environment

nikolaydyankov commented 1 year ago

I've been trying to build a docker image by following the steps from INSTALL.md, but I'm stuck on this:

# Setup MSDeformAttn
cd oneformer/modeling/pixel_decoder/ops
sh make.sh

I tried installing CUDA toolkit globally, I also tried without using conda at all. No luck, I keep getting all kinds of errors. Please help, I've been pulling my hair with this all day. Here is my Dockerfile so far:

# Use the official Ubuntu 20.04 LTS image as the base image
FROM ubuntu:20.04

# Set environment variables to avoid interaction during package installation
ENV DEBIAN_FRONTEND=noninteractive

# Update the package index and install required packages
RUN apt-get update && apt-get install -y --no-install-recommends \
    wget \
    ca-certificates \
    bzip2 \
    build-essential \
    git

# Set the working directory
WORKDIR /opt

# Download and install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
    && chmod +x Miniconda3-latest-Linux-x86_64.sh \
    && ./Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda \
    && rm Miniconda3-latest-Linux-x86_64.sh

# Add conda to the system PATH
ENV PATH="/opt/conda/bin:${PATH}"

# Create the "oneformer" virtual environment
RUN conda create -y -n oneformer

# Activate the "oneformer" virtual environment and run any further commands within it
SHELL ["conda", "run", "-n", "oneformer", "/bin/bash", "-c"]

RUN git clone https://github.com/SHI-Labs/OneFormer.git /OneFormer
RUN cd /OneFormer
WORKDIR /OneFormer

# Install Pytorch
RUN conda install -y pytorch==1.10.1 -c pytorch
RUN conda install -y torchvision==0.11.2 -c pytorch
RUN conda install -y cudatoolkit=11.3 -c pytorch

# Install opencv (required for running the demo)
RUN pip3 install -U opencv-python

# Install detectron2
RUN python -m pip install detectron2 -f \
    https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html

# Install other dependencies
RUN pip3 install git+https://github.com/cocodataset/panopticapi.git
RUN pip3 install git+https://github.com/mcordts/cityscapesScripts.git
RUN pip3 install -r requirements.txt

# Setup wand
RUN pip3 install wandb
#ENV WANDB_API_KEY=...
#RUN wandb login

# Setup MSDeformAttn
# THIS IS WHERE IT BREAKS
# ENV CUDA_HOME=/opt/conda/envs/oneformer/lib/python3.9/site-packages/torch/cuda
# ENV FORCE_CUDA=1
RUN cd oneformer/modeling/pixel_decoder/ops && \
    sh ./make.sh

# Set the entrypoint to use the "oneformer" virtual environment by default
ENTRYPOINT ["conda", "run", "--no-capture-output", "-n", "oneformer"]

# Set the default command to run when starting the container
CMD ["/bin/bash"]

And this is the error that I'm getting:

[19/19] RUN cd oneformer/modeling/pixel_decoder/ops &&     sh ./make.sh:
#0 1.605 /opt/conda/envs/oneformer/lib/python3.9/site-packages/torch/utils/cpp_extension.py:381: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
#0 1.605   warnings.warn(msg.format('we could not find ninja.'))
#0 1.605 error: [Errno 2] No such file or directory: '/opt/conda/envs/oneformer/lib/python3.9/site-packages/torch/cuda/bin/nvcc'
#0 1.605
#0 1.605 ERROR conda.cli.main_run:execute(47): `conda run /bin/bash -c cd oneformer/modeling/pixel_decoder/ops &&     sh ./make.sh` failed. (See above for error)
#0 1.605 No CUDA runtime is found, using CUDA_HOME='/opt/conda/envs/oneformer/lib/python3.9/site-packages/torch/cuda'
#0 1.605 running build
#0 1.605 running build_py
#0 1.605 creating build
#0 1.605 creating build/lib.linux-x86_64-3.9
#0 1.605 creating build/lib.linux-x86_64-3.9/functions
#0 1.605 copying functions/__init__.py -> build/lib.linux-x86_64-3.9/functions
#0 1.605 copying functions/ms_deform_attn_func.py -> build/lib.linux-x86_64-3.9/functions
#0 1.605 creating build/lib.linux-x86_64-3.9/modules
#0 1.605 copying modules/__init__.py -> build/lib.linux-x86_64-3.9/modules
#0 1.605 copying modules/ms_deform_attn.py -> build/lib.linux-x86_64-3.9/modules
#0 1.605 running build_ext
#0 1.605
------
failed to solve: executor failed running [conda run -n oneformer /bin/bash -c cd oneformer/modeling/pixel_decoder/ops &&     sh ./make.sh]: exit code: 1

nikolaydyankov commented 1 year ago

On a sidenote, adding a dockerfile in the /demo folder would be amazing.

nikolaydyankov commented 1 year ago

Another docker file with a different error:

#0 16.18 /usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py:381: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
#0 16.18   warnings.warn(msg.format('we could not find ninja.'))
#0 16.18 Traceback (most recent call last):
#0 16.18   File "setup.py", line 69, in <module>
#0 16.18     setup(
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/setuptools/__init__.py", line 153, in setup
#0 16.18     return distutils.core.setup(**attrs)
#0 16.18   File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
#0 16.18     dist.run_commands()
#0 16.18   File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
#0 16.18     self.run_command(cmd)
#0 16.18   File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
#0 16.18     cmd_obj.run()
#0 16.18   File "/usr/lib/python3.8/distutils/command/build.py", line 135, in run
#0 16.18     self.run_command(cmd_name)
#0 16.18   File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
#0 16.18     self.distribution.run_command(command)
#0 16.18   File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
#0 16.18     cmd_obj.run()
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 79, in run
#0 16.18     _build_ext.run(self)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run
#0 16.18     _build_ext.build_ext.run(self)
#0 16.18   File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
#0 16.18     self.build_extensions()
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 735, in build_extensions
#0 16.18     build_ext.build_extensions(self)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
#0 16.18     _build_ext.build_ext.build_extensions(self)
#0 16.18   File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
#0 16.18     self._build_extensions_serial()
#0 16.18   File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
#0 16.18     self.build_extension(ext)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 202, in build_extension
#0 16.18     _build_ext.build_extension(self, ext)
#0 16.18   File "/usr/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
#0 16.18     objects = self.compiler.compile(sources,
#0 16.18   File "/usr/lib/python3.8/distutils/ccompiler.py", line 574, in compile
#0 16.18     self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 483, in unix_wrap_single_compile
#0 16.18     cflags = unix_cuda_flags(cflags)
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 450, in unix_cuda_flags
#0 16.18     cflags + _get_cuda_arch_flags(cflags))
#0 16.18   File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1606, in _get_cuda_arch_flags
#0 16.18     arch_list[-1] += '+PTX'
#0 16.18 IndexError: list index out of range
------
failed to solve: executor failed running [/bin/sh -c cd oneformer/modeling/pixel_decoder/ops &&     sh ./make.sh]: exit code: 1

And here is the dockerfile:

FROM nvidia/cuda:11.3.1-devel-ubuntu20.04

RUN apt-get update && apt-get upgrade -y
RUN apt-get install -y --no-install-recommends \
    python3 python3-pip python3-dev build-essential \
    libgomp1 \
    git
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1
RUN python -m pip install --upgrade pip wheel

# Install PyTorch 1.10.1 and torchvision 0.11.2 with CUDA 11.3 support
RUN python -m pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

# Clone the OneFormer repository
RUN git clone https://github.com/SHI-Labs/OneFormer.git /OneFormer
RUN cd /OneFormer
WORKDIR /OneFormer

# Install detectron2 and other dependencies
RUN python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
RUN pip install git+https://github.com/cocodataset/panopticapi.git
RUN pip install git+https://github.com/mcordts/cityscapesScripts.git
RUN pip install -r requirements.txt

# Setup wand
RUN pip install wandb
#ENV WANDB_API_KEY=...
#RUN wandb login

# Setup MSDeformAttn
ENV CUDA_HOME=/usr/local/cuda-11.3
ENV FORCE_CUDA=1
RUN cd oneformer/modeling/pixel_decoder/ops && \
    sh ./make.sh

# Set the default command to run when starting the container
CMD ["/bin/bash"]

nikolaydyankov commented 1 year ago

This is as far as I got:

FROM nvidia/cuda:11.3.1-devel-ubuntu20.04

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV LANG=C.UTF-8
ENV LC_ALL=C.UTF-8

# Update package list and install dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    wget \
    ca-certificates \
    git \
    build-essential \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender1 \
    libyaml-cpp-dev \
    libopencv-dev \
    && rm -rf /var/lib/apt/lists/*

# Install GCC, G++ 9
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc-9 \
    g++-9 \
    && rm -rf /var/lib/apt/lists/* \
    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 100 \
    && update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 100

# Install conda 4.12.0
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-Linux-x86_64.sh -O miniconda.sh \
    && chmod +x miniconda.sh \
    && ./miniconda.sh -b -p /opt/conda \
    && rm miniconda.sh \
    && /opt/conda/bin/conda clean -tipsy \
    && ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh \
    && echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc \
    && echo "conda activate base" >> ~/.bashrc

# Set some environment variables
ENV PATH /opt/conda/bin:$PATH
ENV WANDB_API_KEY=...
ENV CUDA_HOME=/usr/local/cuda
ENV FORCE_CUDA=1

# Clone OneFormer repository and set working directory
RUN git clone https://github.com/SHI-Labs/OneFormer.git /OneFormer
WORKDIR /OneFormer

# Install dependencies
RUN conda install pytorch==1.10.1 torchvision==0.11.2 cudatoolkit=11.3 -c pytorch -c conda-forge
RUN pip3 install -U opencv-python
RUN python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
RUN pip3 install git+https://github.com/cocodataset/panopticapi.git
RUN pip3 install git+https://github.com/mcordts/cityscapesScripts.git
RUN pip3 install -r requirements.txt
#RUN pip3 install wandb
#RUN wandb login
RUN pip3 install colormap
RUN pip3 install easydev

# Setup MSDeformAttn
ENV TORCH_CUDA_ARCH_LIST="6.0;6.1;6.2;7.0;7.5;8.0;8.6+PTX"
RUN cd oneformer/modeling/pixel_decoder/ops && \
    chmod +x make.sh && \
    ./make.sh

# Downgrade numpy
RUN pip3 uninstall numpy -y
RUN pip3 install numpy==1.23.1

# Set the default command to run when starting the container
CMD ["/bin/bash"]

This image works, but the model can't be trained on RTX4090 due to a bug in pytorch:

Traceback (most recent call last):
  File "/OneFormer/workspace/oneformer-scripts/train.py", line 448, in <module>
    trainer.train()
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train
    super().train(self.start_iter, self.max_iter)
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 395, in run_step
    loss_dict = self.model(data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/OneFormer/oneformer/oneformer_model.py", line 296, in forward
    losses = self.criterion(outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/OneFormer/oneformer/modeling/criterion.py", line 306, in forward
    indices = self.matcher(aux_outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/OneFormer/oneformer/modeling/matcher.py", line 202, in forward
    return self.memory_efficient_forward(outputs, targets)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/OneFormer/oneformer/modeling/matcher.py", line 161, in memory_efficient_forward
    cost_mask = batch_sigmoid_ce_loss_jit(out_mask, tgt_mask)
RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)

nvrtc compilation failed:

#define NAN __int_as_float(0x7fffffff)
#define POS_INFINITY __int_as_float(0x7f800000)
#define NEG_INFINITY __int_as_float(0xff800000)

template<typename T>
__device__ T maximum(T a, T b) {
  return isnan(a) ? a : (a > b ? a : b);
}

template<typename T>
__device__ T minimum(T a, T b) {
  return isnan(a) ? a : (a < b ? a : b);
}

extern "C" __global__
void fused_neg_add(float* ttargets_1, float* aten_add) {
{
  float v = __ldg(ttargets_1 + (long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x));
  aten_add[(long long)(threadIdx.x) + 512ll * (long long)(blockIdx.x)] = (0.f - v) + 1.f;
}
}

I can't update the cuda version, otherwise MSDeformAttn doesn't build. This is the issue in the pytorch repo: https://github.com/pytorch/pytorch/issues/87595#issue-1420810328

praeclarumjj3 commented 1 year ago

Hi @nikolaydyankov, thanks for your interest in our work. Did you take a look at the Dockerfile used for hosting our HuggingFace Space demo? If not, it might be worth a look.

praveenVnktsh commented 1 year ago

I'm running into the same problem with the architecture mismatch. Unable to run on a RTX4090. I've temporarily replaced all the JIT functions with regular functions and it runs, but its very slow.

linzy5 commented 3 months ago

@nikolaydyankov Hi, I encounter exactly the same problem as you. Thanks to @praeclarumjj3 , I found the key point in Dockerfile used in oneformer's huggingface space. Two key command in the dockerfile is below:

ARG TORCH_CUDA_ARCH_LIST=7.5+PTX
RUN cd /path/to/ops && FORCE_CUDA=1 python setup.py build install

The TORCH_CUDA_ARCH_LIST seems need to change to fit your GPU and cuda version.~~

SHI-Labs / OneFormer

Can't setup the environment #55