Add Pretrained PyTorch TorchVision Models

StanHatko commented 2 years ago

I made a pull request https://github.com/StatCan/aaw-kubeflow-containers/pull/302 to add pretrained PyTorch models to the jupyterlab-pytorch image. Please let me know if you have any comments or if there are any problems that need to be fixed.

StanHatko commented 2 years ago

@blairdrummond

StanHatko commented 2 years ago

Here is the list of URLs to be mirrored. An alternative method is to simply wget these into the proper directory. The URLs are:

blairdrummond commented 2 years ago

@StanHatko I have another idea which might be interesting; I could see us having a MinIO bucket or something within the cluster specifically for storing/caching files like this, so that downloads would be very fast.

Typically it's good to keep the docker images small, as that affects boot time and other things, but maybe an in-cluster mirror would be useful?

@brendangadd you have any thoughts on this kind of caching?

StanHatko commented 2 years ago

That could work, one disadvantage is that the URLs are hardcoded in the PyTorch package (see for example https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py near the top). So users will have to manually configure downloads of pretrained models (instead of using the package like they would on a regular computer which automatically downloads the pretrained model if requested and not found on disk), unless there is a way to override that or to intercept such requests and direct them to the Artifactory or something else.

blairdrummond commented 2 years ago

@brendangadd if we want to try some crazy stuff, I think EnvoyFilters can do interception at that level

https://istio.io/latest/docs/reference/config/networking/envoy-filter/

StanHatko commented 2 years ago

From https://pytorch.org/vision/master/models.html it says that the TORCH_HOME environment variable can be set to specify the cache directory.

Instancing a pre-trained model will download its weights to a cache directory. This directory can be set using the TORCH_HOME environment variable. See torch.hub.load_state_dict_from_url() for details.

From https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url it includes a parameter to remap storage locations.

map_location (optional) – a function or a dict specifying how to remap storage locations (see torch.load)

The documentation for torch.load says:

If map_location is a callable, it will be called once for each serialized storage with two arguments: storage and location. The storage argument will be the initial deserialization of the storage, residing on the CPU. Each serialized storage has a location tag associated with it which identifies the device it was saved from, and this tag is the second argument passed to map_location. The builtin location tags are 'cpu' for CPU tensors and 'cuda:device_id' (e.g. 'cuda:2') for CUDA tensors. map_location should return either None or a storage. If map_location returns a storage, it will be used as the final deserialized object, already moved to the right device. Otherwise, torch.load() will fall back to the default behavior, as if map_location wasn’t specified.

If TORCH_HOME can be pointed at a fast read-only SSD storage accessible from all nodes it might do the job.

StanHatko commented 2 years ago

Anyway, the problem with these solutions (like creating fast read-only storage accessible from all nodes and point TORCH_HOME there or intercepting URLs with EnvoyFilters) is that only the AAW administrators can create such a solution, I cannot do it myself.

StanHatko commented 2 years ago

If we are OK with the AAW being different from home computers in this regard, the simplest solution may be to mirror these pretrained model URLs in the Artifactory and clearly document that for the pretrained models (both torchvision and others like word embeddings). Then we could have a small script in the image to download requested pretrained models, for example:

download-torchvision-model.sh resnet18

This script download-torchvision-model.sh will pull the requested torchvision model from the Artifactory and save it in the correct directory that torchvision checks.

StanHatko commented 2 years ago

To avoid greatly increasing the size of the image and long build times, I think the Artifactory approach is better. Please mirror the pretrained model URLs above in the Artifactory. Once that's done I can create a small script download-torchvision-model.sh described above, add that script to the image, and add it to the documentation.

blairdrummond commented 2 years ago

@bryanpaget this is another possibly cool idea, as it makes pre-trained models available for protected-b notebooks

ToucheSir commented 2 years ago

Another source of pretrained model weights we discussed yesterday was https://huggingface.co/models, also with a predefined list of acceptable models.

StanHatko commented 2 years ago

We just need to gather a list of URLs to mirror in addition to the ones above (the huggingface.co site has 903 pages of models, but at least we can mirror the common and most important models from there). I'll post some additional word embedding URLs below.

An Artifactory administrator simply needs to add these URLs to Artifactory. Once I have the URLs I can make the small script mentioned above, a better interface may be ./download-pretrained-model.sh torch-resnet18 or ./download-pretrained-model.sh fasttext-cc-fr. In the future if there's a way to intercept URL downloads on AAW and redirect them to Artifactory that would be even better, but for now the ./download-pretrained-model.sh should be good enough.

StanHatko commented 2 years ago

The following URLs are from existing GitLab issues.

From https://fasttext.cc/docs/en/english-vectors.html (contains information and reference paper to cite if publishing paper based on these):

From https://fasttext.cc/docs/en/crawl-vectors.html (contains information and reference paper to cite if publishing paper based on these):

From https://fasttext.cc/docs/en/aligned-vectors.html (contains information and reference papers to cite if publishing paper based on these):

FastText for language detection https://fasttext.cc/docs/en/language-identification.html needs the following:

Here are the URLs to mirror for the pretrained GloVe embeddings:

http://nlp.stanford.edu/data/glove.6B.zip (Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download))
http://nlp.stanford.edu/data/glove.42B.300d.zip (Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download))
http://nlp.stanford.edu/data/glove.840B.300d.zip (Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
http://nlp.stanford.edu/data/glove.twitter.27B.zip (Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download))

StanHatko commented 2 years ago

Various models from HuggingFace people previously requested (click on "Files and Versions" to see the actual files, there's probably some good way to git clone these or something as that tab shows a git repo):

ToucheSir commented 2 years ago

Yes, looks like the huggingface downloads can be done programmatically with git + git lfs:

StanHatko commented 2 years ago

This is good. Now we just need some way to contact the AAW Artifactory manager and ask them to mirror the above in a pretrained-packages folder in the Artifactory.

bryanpaget commented 2 years ago

What is the contact information for the AAW Artifactory manager?

StanHatko commented 2 years ago

@blairdrummond Do you know how to contact the AAW Artifactory manager?

blairdrummond commented 2 years ago

@StanHatko @bryanpaget I think @Jose-Matsuda can log in. We are the owners of the AAW Artifactory. Once Bryan has his accounts we'll be able to look into this

ToucheSir commented 2 years ago

We learned today that a subset of weights from PyTorch Hub and HuggingFace's model list are already mirrored on Artifactory, just not the one available on AAW. @EkramulHoque found an internal ticket about it too.

EkramulHoque commented 2 years ago

@blairdrummond we have found some pre-trained transformer models already downloaded at this Artifactory on NetA.

https://artifactory.statcan.ca:8443/artifactory/webapp/#/artifacts/browse/tree/General/generic-local/transformers-model

will it be possible to make a copy of this for the AAW artifactory?

StanHatko commented 2 years ago

I don't see why we would need to transfer back data from Net A Artifactory to the AAW Artifactory. Wouldn't it be easier to just add the URL to mirror on the AAW Artifactory? Artifactory was specifically built for this job, to mirror repositories and objects.

blairdrummond commented 2 years ago

As per discussion at today's technical elaboration (CC @bryanpaget @Jose-Matsuda )

Investigate the file types and security risk. .pth files are pickle, which are unsafe, we need to assess the types of artifacts under discussion (.npy, .pth, .zip, etc) and see what's up
Look at the licenses applicable on the model repos?
See who controls the endpoints and what the governance is. Is Facebook/PyTorch responsible for the official PyTorch models? Are any user contributed? We will have a better time with trusted sources than user contributed (untrusted) ones.

We can investigate these and compile a list of "trusted" sources in this thread, hopefully. We will talk to our Artifactory rep about this, and we may talk to the upstream folks such as PyTorch or Huggingface.

StanHatko commented 2 years ago

Another site that should be mirrored is https://cdn.proj.org/, which has additional geographic projections and is used automatically by GDAL if it encounters a projection not saved on the system. This will obviously give an error on a non-internet connected system. That website gives mirroring instructions and says the total size of content is 568 MB.

I found this dockerfile https://github.com/bosborn/proj.4/blob/master/Dockerfile that mirrors from that site, specifically it runs the following:

# Put this first as this is rarely changing
RUN \
    mkdir -p /usr/share/proj; \
    wget --no-verbose --mirror https://cdn.proj.org/; \
    rm -f cdn.proj.org/*.js; \
    rm -f cdn.proj.org/*.css; \
    mv cdn.proj.org/* /usr/share/proj/; \
    rmdir cdn.proj.org

StanHatko commented 2 years ago

With GDAL installed in a conda virtual environment, it uses /etc/share/proj as the projections directory (like /etc/share/proj/us_nga_egm96_15.tif) and not usr/share/proj (what the above example uses). The /etc/share/proj is writable by the AAW user, I'm able to put projection files there which then make the corresponding projections usable by GDAL.

More generally (see https://proj.org/resource_files.html), on Linux it will use ${XDG_DATA_HOME}/proj if XDG_DATA_HOME is defined, else ${HOME}/.local/share/proj. For me in the conda virtual environment with GDAL installed XDG_DATA_HOME is /etc/share/proj.

Souheil-Yazji commented 5 months ago

Great convos here, marking as stale for now. Please create another issue if needed.

StatCan / aaw-kubeflow-containers

Add Pretrained PyTorch TorchVision Models #303