IBM / pytorch-large-model-support

Large Model Support in PyTorch
Apache License 2.0
132 stars 19 forks source link

Docker image for pytorch LMS 1.5? #3

Open sandias42 opened 4 years ago

sandias42 commented 4 years ago

Hi LMS folks,

I've really been enjoying using your fork (via the pytorch 1.3 lms docker image) for training large models the last few months. However, I'd like to upgrade to pytorch 1.5 and can't seem to find the corresponding docker image, even though it looks like you have the git patch option for 1.5 (I don't trust myself to compile from source with the patch).

Does this docker image exist somewhere?

Thanks

jayfurmanek commented 4 years ago

Unfortunately no. We're going to be making some changes to how we do some of what we do in a more open way. You'll have to stay tuned for that, but as of now the only method is to apply the patch and build it yourself.

Sorry!

sandias42 commented 4 years ago

Excited to hear any updates on this- really appreciate your work on this repo!

jayfurmanek commented 4 years ago

FYI, We have built it and put it here: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/

We've no built a corresponding docker image, however)

sandias42 commented 4 years ago

Dear jayfurmanek, Wow, cool! Thank you so much for following up on this.

Forgive my ignorance. This looks like a conda channel, is that correct? Do you expect a simple conda install pytorch=1.5.0 -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/ to do the trick? Are their dependencies outside of what would be installed with the above?

I have never built a library with Cuda dependencies from source before and am perpetually worried about messing up driver compatibility or something else in a way which is either obviously borked or leads to serious performance hits down the line (hence my reliance on dockerized versions of libraries). I had a lot of trouble getting conda install to work with LMS until I found the docker version of 1.3.

If you helped me out with a few tips on how to install in a docker environment and I am successful in the build, I'd be happy to link to the image/ dockerfile here so others could use this too.

I also know you guys are moving in other directions with this project and I am conscious of your valuable time, though! So no pressure : )

I would typically start with a base docker image like the pytorch 1.5 runtime docker pull pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime and try uninstalling the existing pytorch version and conda installing pytorch from your channel (see above). But perhaps this makes env problems more likely bc conda doesn't fully remove the old env? I'd like to keep the env as stripped down to just pytorch LMS + dependencies if possible.

Do you have a base image you'd recommend I try to add this too? Or more generally any outline of how you would approach containerizing this?

Many thanks again for your efforts on this awesome project.