Closed marcioluish closed 3 years ago
According to this stackoverflow answer, the error might confusingly occur on running out of GPU memory. Can you check whether this is the case for you?
According to this stackoverflow answer, the error might confusingly occur on running out of GPU memory. Can you check whether this is the case for you?
Hi @AdeelH!
This isn't the case. Memory is fine and I've already tried to change torch.backends flags values.
E.g.: torch.backends.cudd.benchmark and so on
Could you verify if I'm installing Rastervision correctly in the Dockerfile above?
IMPORTANT: When I run conv functions directly via pytorch, it doesn't return any error
Just to confirm, are you doing this inside the Docker container?
IMPORTANT: When I run conv functions directly via pytorch, it doesn't return any error
Just to confirm, are you doing this inside the Docker container?
Yes, with the container in direct communication with the HPC.
RV doesn't doesn't make any low-level changes to TorchVision models, so it is not clear to me why this should happen only when using RV.
To get to the bottom of this, I would suggest running those same TorchVision models outside of RV (but still inside the Docker container). E.g. if you're running into this error when doing chip classification, you should try running a ResNet-50. You can find out which models RV uses by looking at the rastervision/pytorch_learner/*_learner.py
files, but feel free to ask for clarifications.
@AdeelH We've made progress using an image with pytorch 1.9 for our hardware. Now rasterivison is woking!
Issue can be closed. Cheers.
Glad it worked out. Please feel free to open new issues if you have any other questions or suggestions!
❓ Questions and Help
Hi,
I'm trying to integrate rastervision with a HPC machine and run it in a container. To do that, I have to use a specific Nvidia pytorch image (https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-02.html#rel_21-02), which is compatible with my hardware.
As you can see in the link above, it uses:
However, when I try to train a model by running 'rastervision run', I get the following error with cuDNN:
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 390, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
IMPORTANT: When I run conv functions directly via pytorch, it doesn't return any error, so I believe that the error is in the integration between torch version that I'm running and rastervision.
Full error traceback:
>Training [#######-----------------------------] 20% >Traceback (most recent call last): > File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main > return _run_code(code, main_globals, None, > File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code > exec(code, run_globals) > File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 248, inMy DockerFile:
> FROM nvcr.io/nvidia/pytorch:21.02-py3 > > ENV DEBIAN_FRONTEND=noninteractive > ARG DEBIAN_FRONTEND=noninteractive > > WORKDIR /workspace > > RUN apt-get update > > RUN apt install libprotobuf-dev protobuf-compiler -y > > RUN apt-get update && apt-get install -y software-properties-common \ > && rm -rf /var/lib/apt/lists/* \ > && add-apt-repository "deb http://security.ubuntu.com/ubuntu xenial-security main" \ > && apt-get update && apt-get install -y \ > build-essential \ > libsm6 \ > libxext6 \ > libfontconfig1 \ > libxrender1 \ > libswscale-dev \ > libtbb2 \ > libtbb-dev \ > libjpeg-dev \ > libpng-dev \ > libtiff-dev \ > libjasper-dev \ > libavformat-dev \ > libpq-dev \ > libturbojpeg \ > git \ > libgl1-mesa-glx \ > ffmpeg \ > && apt-get clean \ > && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* > > ENV CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt > > RUN pip install lvis > > RUN apt-get install autoconf automake libtool curl make g++ unzip -y > > //Protoc > RUN curl -OL https://github.com/google/protobuf/releases/download/v3.5.1/protoc-3.5.1-linux-x86_64.zip > RUN unzip protoc-3.5.1-linux-x86_64.zip -d protoc3 > RUN mv protoc3/bin/* /usr/local/bin/ > RUN mv protoc3/include/* /usr/local/include/ > > RUN pip install -U catalyst > > //Install gdal > RUN conda install -y -c conda-forge gdal=3.0.4 > //Setup GDAL_DATA directory, rasterio needs it. > ENV GDAL_DATA=/opt/conda/lib/python3.8/site-packages/rasterio/gdal_data/ > > WORKDIR /opt/src/ > > COPY ./requirements-dev.txt /opt/src/requirements-dev.txt > RUN pip install -r requirements-dev.txt > > //Install requirements for each package. > COPY ./rastervision_pipeline/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_aws_s3/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_aws_batch/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_pytorch_learner/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_gdal_vsi/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > COPY ./rastervision_core/requirements.txt /opt/src/requirements.txt > RUN pip install $(grep -ivE "rastervision_*" requirements.txt) > > //Install docs/requirements.txt > COPY ./docs/requirements.txt /opt/src/docs/requirements.txt > RUN pip install -r docs/requirements.txt > > COPY scripts /opt/src/scripts/ > COPY scripts/rastervision /usr/local/bin/rastervision > COPY tests /opt/src/tests/ > COPY integration_tests /opt/src/integration_tests/ > COPY .flake8 /opt/src/.flake8 > COPY .coveragerc /opt/src/.coveragerc > > //Needed for click to work > ENV LC_ALL C.UTF-8 > ENV LANG C.UTF-8 > ENV PROJ_LIB /opt/conda/share/proj/ > > //Copy code for each package. > ENV PYTHONPATH=/opt/src:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_pipeline/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_aws_s3/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_aws_batch/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_gdal_vsi/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_core/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_pytorch_learner/:$PYTHONPATH > ENV PYTHONPATH=/opt/src/rastervision_pytorch_backend/:$PYTHONPATH > > COPY ./rastervision_pipeline/ /opt/src/rastervision_pipeline/ > COPY ./rastervision_aws_s3/ /opt/src/rastervision_aws_s3/ > COPY ./rastervision_aws_batch/ /opt/src/rastervision_aws_batch/ > COPY ./rastervision_core/ /opt/src/rastervision_core/ > COPY ./rastervision_pytorch_learner/ /opt/src/rastervision_pytorch_learner/ > COPY ./rastervision_pytorch_backend/ /opt/src/rastervision_pytorch_backend/ > COPY ./rastervision_gdal_vsi/ /opt/src/rastervision_gdal_vsi/ > > RUN pip install rastervision==0.13.1 --no-dependencies > RUN pip install rastervision_pipeline==0.13.1 --no-dependencies > RUN pip install rastervision_aws_s3==0.13.1 --no-dependencies > RUN pip install rastervision_aws_batch==0.13.1 --no-dependencies > RUN pip install rastervision_core==0.13.1 --no-dependencies > RUN pip install rastervision_pytorch_learner==0.13.1 --no-dependencies > RUN pip install rastervision_pytorch_backend==0.13.1 --no-dependencies > RUN pip install rastervision_gdal_vsi==0.13.1 --no-dependencies > > RUN pip install pandas \ > pip install geopandas > > CMD ["bash"]I had to RUN each rastervision installation with
--no-dependencies
option at the end to not update some pachages, e.g.:torch
rastervision_pytorch_learner/requirements.txt changed
> rastervision_pipeline==0.13.1 > rastervision_core==0.13.1 > numpy<1.17 > pillow==5.0.* > ~~torch==1.7.*~~ > ~~torchvision==0.8.*~~ > ~~tensorboard==1.15.*~~ > albumentations==0.5.* > cython==0.28.* > pycocotools==2.0.* > future==0.18.* > psutil==5.8.*rastervision_core/requirements.txt changed
>rastervision_pipeline==0.13.1 >numpy<1.17 >shapely==1.6.* >pillow==5.0.* >pyproj==2.6.* >~~rasterio==1.0.7~~ >scikit-learn==0.19.* >imageio==2.3.* >pystac==0.5.2 >supermercado==0.0.* >mask-to-polygons==0.0.2