Closed StanHatko closed 5 months ago
@blairdrummond
Here is the list of URLs to be mirrored. An alternative method is to simply wget these into the proper directory. The URLs are:
@StanHatko I have another idea which might be interesting; I could see us having a MinIO bucket or something within the cluster specifically for storing/caching files like this, so that downloads would be very fast.
Typically it's good to keep the docker images small, as that affects boot time and other things, but maybe an in-cluster mirror would be useful?
@brendangadd you have any thoughts on this kind of caching?
That could work, one disadvantage is that the URLs are hardcoded in the PyTorch package (see for example https://github.com/pytorch/vision/blob/main/torchvision/models/densenet.py near the top). So users will have to manually configure downloads of pretrained models (instead of using the package like they would on a regular computer which automatically downloads the pretrained model if requested and not found on disk), unless there is a way to override that or to intercept such requests and direct them to the Artifactory or something else.
@brendangadd if we want to try some crazy stuff, I think EnvoyFilters can do interception at that level
https://istio.io/latest/docs/reference/config/networking/envoy-filter/
From https://pytorch.org/vision/master/models.html it says that the TORCH_HOME
environment variable can be set to specify the cache directory.
Instancing a pre-trained model will download its weights to a cache directory. This directory can be set using the TORCH_HOME environment variable. See torch.hub.load_state_dict_from_url() for details.
From https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url it includes a parameter to remap storage locations.
map_location (optional) – a function or a dict specifying how to remap storage locations (see torch.load)
The documentation for torch.load says:
If map_location is a callable, it will be called once for each serialized storage with two arguments: storage and location. The storage argument will be the initial deserialization of the storage, residing on the CPU. Each serialized storage has a location tag associated with it which identifies the device it was saved from, and this tag is the second argument passed to map_location. The builtin location tags are 'cpu' for CPU tensors and 'cuda:device_id' (e.g. 'cuda:2') for CUDA tensors. map_location should return either None or a storage. If map_location returns a storage, it will be used as the final deserialized object, already moved to the right device. Otherwise, torch.load() will fall back to the default behavior, as if map_location wasn’t specified.
If TORCH_HOME can be pointed at a fast read-only SSD storage accessible from all nodes it might do the job.
Anyway, the problem with these solutions (like creating fast read-only storage accessible from all nodes and point TORCH_HOME there or intercepting URLs with EnvoyFilters) is that only the AAW administrators can create such a solution, I cannot do it myself.
If we are OK with the AAW being different from home computers in this regard, the simplest solution may be to mirror these pretrained model URLs in the Artifactory and clearly document that for the pretrained models (both torchvision and others like word embeddings). Then we could have a small script in the image to download requested pretrained models, for example:
download-torchvision-model.sh resnet18
This script download-torchvision-model.sh
will pull the requested torchvision model from the Artifactory and save it in the correct directory that torchvision checks.
To avoid greatly increasing the size of the image and long build times, I think the Artifactory approach is better. Please mirror the pretrained model URLs above in the Artifactory. Once that's done I can create a small script download-torchvision-model.sh
described above, add that script to the image, and add it to the documentation.
@bryanpaget this is another possibly cool idea, as it makes pre-trained models available for protected-b notebooks
Another source of pretrained model weights we discussed yesterday was https://huggingface.co/models, also with a predefined list of acceptable models.
We just need to gather a list of URLs to mirror in addition to the ones above (the huggingface.co site has 903 pages of models, but at least we can mirror the common and most important models from there). I'll post some additional word embedding URLs below.
An Artifactory administrator simply needs to add these URLs to Artifactory. Once I have the URLs I can make the small script mentioned above, a better interface may be ./download-pretrained-model.sh torch-resnet18
or ./download-pretrained-model.sh fasttext-cc-fr
. In the future if there's a way to intercept URL downloads on AAW and redirect them to Artifactory that would be even better, but for now the ./download-pretrained-model.sh
should be good enough.
The following URLs are from existing GitLab issues.
From https://fasttext.cc/docs/en/english-vectors.html (contains information and reference paper to cite if publishing paper based on these):
From https://fasttext.cc/docs/en/crawl-vectors.html (contains information and reference paper to cite if publishing paper based on these):
From https://fasttext.cc/docs/en/aligned-vectors.html (contains information and reference papers to cite if publishing paper based on these):
FastText for language detection https://fasttext.cc/docs/en/language-identification.html needs the following:
Here are the URLs to mirror for the pretrained GloVe embeddings:
Various models from HuggingFace people previously requested (click on "Files and Versions" to see the actual files, there's probably some good way to git clone these or something as that tab shows a git repo):
Yes, looks like the huggingface downloads can be done programmatically with git + git lfs:
This is good. Now we just need some way to contact the AAW Artifactory manager and ask them to mirror the above in a pretrained-packages folder in the Artifactory.
What is the contact information for the AAW Artifactory manager?
@blairdrummond Do you know how to contact the AAW Artifactory manager?
@StanHatko @bryanpaget I think @Jose-Matsuda can log in. We are the owners of the AAW Artifactory. Once Bryan has his accounts we'll be able to look into this
We learned today that a subset of weights from PyTorch Hub and HuggingFace's model list are already mirrored on Artifactory, just not the one available on AAW. @EkramulHoque found an internal ticket about it too.
@blairdrummond we have found some pre-trained transformer models already downloaded at this Artifactory on NetA.
will it be possible to make a copy of this for the AAW artifactory?
I don't see why we would need to transfer back data from Net A Artifactory to the AAW Artifactory. Wouldn't it be easier to just add the URL to mirror on the AAW Artifactory? Artifactory was specifically built for this job, to mirror repositories and objects.
As per discussion at today's technical elaboration (CC @bryanpaget @Jose-Matsuda )
.pth
files are pickle, which are unsafe, we need to assess the types of artifacts under discussion (.npy, .pth, .zip, etc) and see what's upWe can investigate these and compile a list of "trusted" sources in this thread, hopefully. We will talk to our Artifactory rep about this, and we may talk to the upstream folks such as PyTorch or Huggingface.
Another site that should be mirrored is https://cdn.proj.org/, which has additional geographic projections and is used automatically by GDAL if it encounters a projection not saved on the system. This will obviously give an error on a non-internet connected system. That website gives mirroring instructions and says the total size of content is 568 MB.
I found this dockerfile https://github.com/bosborn/proj.4/blob/master/Dockerfile that mirrors from that site, specifically it runs the following:
# Put this first as this is rarely changing
RUN \
mkdir -p /usr/share/proj; \
wget --no-verbose --mirror https://cdn.proj.org/; \
rm -f cdn.proj.org/*.js; \
rm -f cdn.proj.org/*.css; \
mv cdn.proj.org/* /usr/share/proj/; \
rmdir cdn.proj.org
With GDAL installed in a conda virtual environment, it uses /etc/share/proj
as the projections directory (like /etc/share/proj/us_nga_egm96_15.tif
) and not usr/share/proj
(what the above example uses). The /etc/share/proj
is writable by the AAW user, I'm able to put projection files there which then make the corresponding projections usable by GDAL.
More generally (see https://proj.org/resource_files.html), on Linux it will use ${XDG_DATA_HOME}/proj
if XDG_DATA_HOME
is defined, else ${HOME}/.local/share/proj
. For me in the conda virtual environment with GDAL installed XDG_DATA_HOME
is /etc/share/proj
.
Great convos here, marking as stale for now. Please create another issue if needed.
I made a pull request https://github.com/StatCan/aaw-kubeflow-containers/pull/302 to add pretrained PyTorch models to the jupyterlab-pytorch image. Please let me know if you have any comments or if there are any problems that need to be fixed.