azavea / raster-vision

An open source library and framework for deep learning on satellite and aerial imagery.
https://docs.rastervision.io
Other
2.08k stars 388 forks source link

Rastervision and HPC machine integration cuDNN conv function ERROR #1280

Closed marcioluish closed 3 years ago

marcioluish commented 3 years ago

❓ Questions and Help

Hi,

I'm trying to integrate rastervision with a HPC machine and run it in a container. To do that, I have to use a specific Nvidia pytorch image (https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-02.html#rel_21-02), which is compatible with my hardware.

As you can see in the link above, it uses:

However, when I try to train a model by running 'rastervision run', I get the following error with cuDNN:

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 390, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

IMPORTANT: When I run conv functions directly via pytorch, it doesn't return any error, so I believe that the error is in the integration between torch version that I'm running and rastervision.

Full error traceback: >Training [#######-----------------------------] 20% >Traceback (most recent call last): > File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main > return _run_code(code, main_globals, None, > File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code > exec(code, run_globals) > File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 248, in > main() > File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 722, in __call__ > return self.main(*args, **kwargs) > File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 697, in main > rv = self.invoke(ctx) > File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke > return _process_result(sub_ctx.command.invoke(sub_ctx)) > File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 895, in invoke > return ctx.invoke(self.callback, **ctx.params) > File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 535, in invoke > return callback(*args, **kwargs) > File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 235, in run_command > _run_command( > File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 217, in _run_command > command_fn() > File "/opt/src/rastervision_core/rastervision/core/rv_pipeline/rv_pipeline.py", line 134, in train > backend.train(source_bundle_uri=self.config.source_bundle_uri) > File "/opt/src/rastervision_pytorch_backend/rastervision/pytorch_backend/pytorch_learner_backend.py", line 75, in train > learner.main() > File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 184, in main > self.train() > File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1197, in train > train_metrics = self.train_epoch() > File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1126, in train_epoch > output = self.train_step(batch, batch_ind) > File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/semantic_segmentation_learner.py", line 109, in train_step > out = self.post_forward(self.model(x)) > File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl > result = self.forward(*input, **kwargs) > File "/opt/conda/lib/python3.8/site-packages/torchvision/models/segmentation/_utils.py", line 20, in forward > features = self.backbone(x) > File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl > result = self.forward(*input, **kwargs) > File "/opt/conda/lib/python3.8/site-packages/torchvision/models/_utils.py", line 63, in forward > x = module(x) > File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881, in _call_impl > result = self.forward(*input, **kwargs) > File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 394, in forward > return self._conv_forward(input, self.weight, self.bias) > File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 390, in _conv_forward > return F.conv2d(input, weight, bias, self.stride, >RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
My DockerFile: > FROM nvcr.io/nvidia/pytorch:21.02-py3 > > ENV DEBIAN_FRONTEND=noninteractive > ARG DEBIAN_FRONTEND=noninteractive > > WORKDIR /workspace > > RUN apt-get update > > RUN apt install libprotobuf-dev protobuf-compiler -y > > RUN apt-get update && apt-get install -y software-properties-common \ > && rm -rf /var/lib/apt/lists/* \ > && add-apt-repository "deb http://security.ubuntu.com/ubuntu xenial-security main" \ > && apt-get update && apt-get install -y \ > build-essential \ > libsm6 \ > libxext6 \ > libfontconfig1 \ > libxrender1 \ > libswscale-dev \ > libtbb2 \ > libtbb-dev \ > libjpeg-dev \ > libpng-dev \ > libtiff-dev \ > libjasper-dev \ > libavformat-dev \ > libpq-dev \ > libturbojpeg \ > git \ > libgl1-mesa-glx \ > ffmpeg \ > && apt-get clean \ > && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* > > ENV CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt > > RUN pip install lvis > > RUN apt-get install autoconf automake libtool curl make g++ unzip -y > > //Protoc > RUN curl -OL https://github.com/google/protobuf/releases/download/v3.5.1/protoc-3.5.1-linux-x86_64.zip > RUN unzip protoc-3.5.1-linux-x86_64.zip -d protoc3 > RUN mv protoc3/bin/* /usr/local/bin/ > RUN mv protoc3/include/* /usr/local/include/ > > RUN pip install -U catalyst > > //Install gdal > RUN conda install -y -c conda-forge gdal=3.0.4 > //Setup GDAL_DATA directory, rasterio needs it. > ENV GDAL_DATA=/opt/conda/lib/python3.8/site-packages/rasterio/gdal_data/ > > WORKDIR /opt/src/ > > COPY ./requirements-dev.txt /opt/src/requirements-dev.txt > RUN pip install -r requirements-dev.txt > > //Install requirements for each package. > COPY ./rastervision_pipeline/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_aws_s3/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_aws_batch/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_pytorch_learner/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_gdal_vsi/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_core/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > //Install docs/requirements.txt > COPY ./docs/requirements.txt /opt/src/docs/requirements.txt > RUN pip install -r docs/requirements.txt > > COPY scripts /opt/src/scripts/ > COPY scripts/rastervision /usr/local/bin/rastervision > COPY tests /opt/src/tests/ > COPY integration_tests /opt/src/integration_tests/ > COPY .flake8 /opt/src/.flake8 > COPY .coveragerc /opt/src/.coveragerc > > //Needed for click to work > ENV LC_ALL C.UTF-8 > ENV LANG C.UTF-8 > ENV PROJ_LIB /opt/conda/share/proj/ > > //Copy code for each package. > ENV PYTHONPATH=/opt/src:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_pipeline/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_aws_s3/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_aws_batch/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_gdal_vsi/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_core/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_pytorch_learner/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_pytorch_backend/:$PYTHONPATH > > COPY ./rastervision_pipeline/ /opt/src/rastervision_pipeline/ > COPY ./rastervision_aws_s3/ /opt/src/rastervision_aws_s3/ > COPY ./rastervision_aws_batch/ /opt/src/rastervision_aws_batch/ > COPY ./rastervision_core/ /opt/src/rastervision_core/ > COPY ./rastervision_pytorch_learner/ /opt/src/rastervision_pytorch_learner/ > COPY ./rastervision_pytorch_backend/ /opt/src/rastervision_pytorch_backend/ > COPY ./rastervision_gdal_vsi/ /opt/src/rastervision_gdal_vsi/ > > RUN pip install rastervision==0.13.1 --no-dependencies > RUN pip install rastervision_pipeline==0.13.1 --no-dependencies > RUN pip install rastervision_aws_s3==0.13.1 --no-dependencies > RUN pip install rastervision_aws_batch==0.13.1 --no-dependencies > RUN pip install rastervision_core==0.13.1 --no-dependencies > RUN pip install rastervision_pytorch_learner==0.13.1 --no-dependencies > RUN pip install rastervision_pytorch_backend==0.13.1 --no-dependencies > RUN pip install rastervision_gdal_vsi==0.13.1 --no-dependencies > > RUN pip install pandas \ > pip install geopandas > > CMD ["bash"]

I had to RUN each rastervision installation with --no-dependencies option at the end to not update some pachages, e.g.: torch

rastervision_pytorch_learner/requirements.txt changed > rastervision_pipeline==0.13.1 > rastervision_core==0.13.1 > numpy<1.17 > pillow==5.0.* > ~~torch==1.7.*~~ > ~~torchvision==0.8.*~~ > ~~tensorboard==1.15.*~~ > albumentations==0.5.* > cython==0.28.* > pycocotools==2.0.* > future==0.18.* > psutil==5.8.*
rastervision_core/requirements.txt changed >rastervision_pipeline==0.13.1 >numpy<1.17 >shapely==1.6.* >pillow==5.0.* >pyproj==2.6.* >~~rasterio==1.0.7~~ >scikit-learn==0.19.* >imageio==2.3.* >pystac==0.5.2 >supermercado==0.0.* >mask-to-polygons==0.0.2
AdeelH commented 3 years ago

According to this stackoverflow answer, the error might confusingly occur on running out of GPU memory. Can you check whether this is the case for you?

marcioluish commented 3 years ago

According to this stackoverflow answer, the error might confusingly occur on running out of GPU memory. Can you check whether this is the case for you?

Hi @AdeelH!

This isn't the case. Memory is fine and I've already tried to change torch.backends flags values.

E.g.: torch.backends.cudd.benchmark and so on

Could you verify if I'm installing Rastervision correctly in the Dockerfile above?

AdeelH commented 3 years ago

IMPORTANT: When I run conv functions directly via pytorch, it doesn't return any error

Just to confirm, are you doing this inside the Docker container?

marcioluish commented 3 years ago

IMPORTANT: When I run conv functions directly via pytorch, it doesn't return any error

Just to confirm, are you doing this inside the Docker container?

Yes, with the container in direct communication with the HPC.

AdeelH commented 3 years ago

RV doesn't doesn't make any low-level changes to TorchVision models, so it is not clear to me why this should happen only when using RV.

To get to the bottom of this, I would suggest running those same TorchVision models outside of RV (but still inside the Docker container). E.g. if you're running into this error when doing chip classification, you should try running a ResNet-50. You can find out which models RV uses by looking at the rastervision/pytorch_learner/*_learner.py files, but feel free to ask for clarifications.

marcioluish commented 3 years ago

@AdeelH We've made progress using an image with pytorch 1.9 for our hardware. Now rasterivison is woking!

Issue can be closed. Cheers.

AdeelH commented 3 years ago

Glad it worked out. Please feel free to open new issues if you have any other questions or suggestions!