DeepGraphLearning / NBFNet

Official implementation of Neural Bellman-Ford Networks (NeurIPS 2021)
MIT License
197 stars 29 forks source link

[Feature Request] `Dockerfile` / `environment.yml` for better reproducibility #1

Closed SauravMaheshkar closed 1 year ago

SauravMaheshkar commented 2 years ago

Congratulations to the authors for NeurIPS'21, looking forward to your talk during LoGaG


While installing the project on VMs and local systems, I've been running into multiple issues getting the correct package versions installed. Be it CUDA errors while installing torch-scatter and torchdrug or simply pybind11 issues. Having a Dockerfile would help out with preventing such errors and make reproducibility + experimentation easier.

I think it'd be easier and better for there to be a Docker image for torchdrug itself and then the image for NBFNet would just use that as the base image. More than happy to take this up.

This way one could also use the nvidia container toolkit for running experiments across multiple GPUs/nodes easily.

KiddoZhu commented 2 years ago

Hi! That's a great suggestion! We can add a environment.yml for NBFNet soon. For the Docker image, we are not so familiar with the steps, and will probably take some time to figure it out. Like you said, a docker image is easy to launch experiments across multiple nodes, so we will definitely add one for torchdrug.

SauravMaheshkar commented 2 years ago

I just built this Dockerfile over on my fork of the repository.

# syntax=docker/dockerfile:1.2
# To build the image use :-
# $ DOCKER_BUILDKIT=1 docker build .
FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

# metainformation
LABEL version="0.0.1"
LABEL maintainer="Saurav Maheshkar"

# Helpers
ARG DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

WORKDIR /code
COPY . .

RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel
RUN pip3 install --no-cache-dir torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
RUN pip3 install --no-cache-dir torchdrug
RUN pip3 install --no-cache-dir -r requirements.txt

RUN find /opt/conda/lib/ -follow -type f -name '*.a' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.pyc' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.txt' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.mc' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.js.map' -delete \
    && find /opt/conda/lib/ -name '*.c' -delete \
    && find /opt/conda/lib/ -name '*.pxd' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.md' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.png' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.jpg' -delete \
    && find /opt/conda/lib/ -follow -type f -name '*.jpeg' -delete \
    && find /opt/conda/lib/ -name '*.pyd' -delete \
    && find /opt/conda/lib/ -name '__pycache__' | xargs rm -r

ENV PATH /opt/conda/bin:$PATH

Thoughts on this @KiddoZhu ?

KiddoZhu commented 2 years ago

Thanks for the recipe. I just learned some basics of Docker. It looks like I can't import torchdrug correctly with this Dockerfile. It says libXrender.so is missing, which is required by rdkit. Besides, the JIT compliation used in torchdrug relies on nvcc so I guess we need a development version of PyTorch image. I will figure it out.

KiddoZhu commented 2 years ago

Here is my Dockerfile for torchdrug. We have to rely on the development version of PyTorch to use JIT in torchdrug (and also required by NBFNet).

FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel

RUN apt-get update && \
    apt-get install -y libxrender1 && \
    rm -rf /var/lib/apt/lists/*

RUN pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.1+cu111.html  && \
    pip install torchdrug

I am not familiar with how to prune the size of the image. I feel your find ... -delete commands look a little bit unsafe. Have you tested that? @SauravMaheshkar

SauravMaheshkar commented 2 years ago

Yes I have tested that, but in my experience it doesn't contribute much towards decreasing image size. It might be better just to use multi-stage builds, maybe something like :-

# Builder Image
FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel AS builder

....

# Runner Image
FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-devel AS runner

...

the find ... -delete reduces file size by 5 - 15 MB. It might be better to use docker dive.

There's a hyper optimized Dockerfile I work with which can be found here.

SauravMaheshkar commented 2 years ago

Any updates ? @KiddoZhu

SauravMaheshkar commented 2 years ago

Might I also suggest the addition of ENV PATH /opt/conda/bin:$PATH at the end of the Dockerfile, the PyTorch Docker Image uses conda to handle the python interpreter without it's addition the underlying libraries aren't accessible by default.