Container fragility - Githubissues

marcverhagen commented 1 year ago

Because

Not sure how to replicate this, but on one of my machines I could not do a docker-build. It would fail during the pip-install with a "Connection reset by peer" message, typically while installing torch. This went away once I split the requirements file into three files, with the first installing torch, torchvision and torchmetrics. But that is hardly a satisfactory solution.

Maybe related is that on another of my three machine I also get "Connection reset by peer" when downloading "https://download.pytorch.org/models/vgg16-397923af.pth". That download should probably be made part of the image building process.

It is also not totally clear to me what the most efficient build would be. Now we use the clams-python-opencv4-torch2 base image, but with the current requirements all of torch and cuda will be reinstalled because that base is on torch==2.0.1., resulting in an 11.4GB image. I tried to not reinstall torch/cuda, which should be possible by using torchvision==0.15, but that failed with obscure messages.

This leads me to believe that I should use the clams-python-opencv4 base image instead.

Finally, and this should probably also be its own issue in clams-python, the torch images (and probably some others as well) are larger than needed because a pip cache is kept in /root/.cache/pip, which holds 2.6GB of data and the image could therefore be much smaller. Using the following does create a much smaller image.

RUN pip install --no-cache-dir torch==2.1.0
RUN pip install --no-cache-dir torchvision==0.16.0

Done when

No response

Additional context

No response

marcverhagen commented 1 year ago

After some experimenting it looks like the following works reasonably well:

FROM ghcr.io/clamsproject/clams-python-opencv4:1.0.9

ARG CLAMS_APP_VERSION
ENV CLAMS_APP_VERSION ${CLAMS_APP_VERSION}

RUN apt-get update && apt-get install -y wget

RUN pip install --no-cache-dir torch==2.1.0
RUN pip install --no-cache-dir torchvision==0.16.0

WORKDIR /app

RUN wget https://download.pytorch.org/models/vgg16-397923af.pth
RUN mkdir /root/.cache/torch /root/.cache/torch/hub /root/.cache/torch/hub/checkpoints
RUN mv vgg16-397923af.pth /root/.cache/torch/hub/checkpoints

COPY . /app

CMD ["python3", "app.py", "--production"]

keighrim commented 1 year ago

What is the advantage of pip-installing torch/torchvision directly instead of via requirements.txt? I'm working on #30 and it now additionally needs yaml handler as a dependency. I wonder whether we should keep two (identical) files of dependency specs (as dockerfile and req.txt) if there's a clear benefit of doing so.

Also, for backbone models, I think there must be a better way to download a pth file based on the model choice of ours, instead of hard-coding vgg URL (or other better performing models) manually.

keighrim commented 1 year ago

For model download, we can do something like this;

Given a model config yaml file as https://github.com/clamsproject/app-swt-detection/blob/9d4f229c61888e43eb58aeffa8cf1885e28082bb/modeling/models/20231026-164841.config.yml#L1

in backbones.py;

...
if __name__ == "__main__":
    import sys
    # pass the model choice via CLI
    model_map[sys.argv[1]]()
    # initiating a `ExtractorModel` instance will also download the pth file and initiate the torchvision model. When this code terminates, downloaded model file should stay in the local cache dir

then in dockerfile

...
RUN python -m modeling.backbones $(grep "model_type" modeling/classifier-config.yaml | cut -d: -f2)

CMD ["python3", "app.py", "--production"]

marcverhagen commented 11 months ago

Maybe a little big ugly with the grep inside the container file and because it requires the backbones file to cater just to the container file, but we can play with that a bit. Definitely better than

RUN wget https://download.pytorch.org/models/vgg16-397923af.pth

keighrim commented 11 months ago

fixed via #48.

clamsproject / app-swt-detection

Container fragility #24

Because

Done when

Additional context