MartinoMensio / spacy-universal-sentence-encoder

Google USE (Universal Sentence Encoder) for spaCy
MIT License
176 stars 12 forks source link

How to use this with a downloaded model? #9

Open jrruethe opened 4 years ago

jrruethe commented 4 years ago

Hello,

I am trying to use this inside of a Docker image. I have downloaded the model separately from here, and I have performed the pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.3.1/en_use_lg-0.3.1.tar.gz#en_use_lg-0.3.1 as a RUN command in my Dockerfile.

I want to be able to load and use the model that I downloaded externally, but I don't know how to load it. When I use the following, it redownloads the model; I don't know how to tell it to load the one I downloaded manually and added to the Docker image:

# This downloads the model
nlp = spacy.load('en_use_lg')

I tried the following:

nlp = spacy.load("./universal-sentence-encoder-large_5.tar.gz")
# Untarred the file
nlp = spacy.load("./universal-sentence-encoder-large_5")

Both complain that it cannot find meta.json. Can you help me? Thanks!

tanghaoyu258 commented 3 years ago

I've met the same problem. The meta.json can be found in './spacy_universal_sentence_encoder/meta/ '.But still, since i'm in china, it will raise the urlopen error, which means it fails to open the downloaded model .Have you solved it?

MartinoMensio commented 3 years ago

Hi @jrruethe , Thanks for signalling this issue. The problem that you are experiencing is generated from the fact that, in the current setting of this library, the model gets downloaded when first used. Since you have a RUN command in the docker file that installs this library and you never run it when building the image (it is downloaded later when using the container), the files are not included in the image.

I did not think about this case, where you already have a model downloaded, when first developing this wrapper for spacy.

A workaround that should work for you, is the following:

  1. identify the full path where you have downloaded the model (extracted from the tar.gz file from https://tfhub.dev/google/universal-sentence-encoder-large/5 ), let's say that it has been extracted to /Users/foo/Downloads/universal-sentence-encoder-large_5 (containing the subfolders assets variables and the file saved_model.pb)
  2. when creating the container, use a mapping volume to provide the model: docker run -it --name use -v /Users/foo/Downloads/universal-sentence-encoder-large_5:/usr/local/lib/python3.7/site-packages/spacy_universal_sentence_encoder/models/c9fe785512ca4a1b179831acb18a0c6bfba603dd use
  3. You can check with this command inside the docker container to see if the model is mapped correctly ls /usr/local/lib/python3.7/site-packages/spacy_universal_sentence_encoder/models/c9fe785512ca4a1b179831acb18a0c6bfba603dd
  4. Running nlp = spacy.load('en_use_lg')

The path is composed of:

For a less dirty solution, I should modify the part of code where I set the folder for the TensorFlow models https://github.com/MartinoMensio/spacy-universal-sentence-encoder/blob/master/spacy_universal_sentence_encoder/language.py#L71 On its own, TenforFlow Hub uses the environment variable TFHUB_CACHE_DIR to store and avoid re-downloading the models. I should enable the forwarding of the environment variable (instead of just setting it) with the next release of this library.

For @tanghaoyu258, if you already have downloaded the model and have the same issue with docker, I think this workaround should work also for you. If you have already the model downloaded, you can then copy/symlink it to the folder usr/lib/python3.7/site-packages/spacy-universal-sentence-encoder/models/c9fe785512ca4a1b179831acb18a0c6bfba603dd of your python environment.

Otherwise, if you want a pre-packaged model not depending on the network, this is a different issue which I will try to address. In this latter case, I could provide in the release of the models the version "slim" which downloads later and the version "full" which already has the models directory downloaded. I hope this can solve your problems.

Best, Martino

jrruethe commented 3 years ago

Thank you for the response! I wasn't aware of where the model is stored or how the sha1 piece of the path was formed, so that was very helpful. I am pretty sure your suggestion of mounting the model into the container will work perfectly for my needs.

I'll try it out shortly and let you know how it goes. Thanks again!

jrruethe commented 3 years ago

I can confirm that this works for my use-case. But I wanted to add some notes for you or the next person to come along.

I am using debian:buster-slim as my Docker base image. I found that I had to use the following volume mapping for it to work:

docker run -it --rm -v `pwd`/models/universal_sentence_encoder/c9fe785512ca4a1b179831acb18a0c6bfba603dd:/usr/local/lib/python3.7/dist-packages/spacy_universal_sentence_encoder/models/c9fe785512ca4a1b179831acb18a0c6bfba603dd --entrypoint /bin/bash nlp

The path was slightly different, but I figured it out. This allows me to load the model just fine.

Some other caveats that I noticed. I am using Spacy 2.3.2, and I found that when I install your version 0.2.3 using the following, everything works fine:

pip install git+https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub.git@961d1e55dca93db03699f538838d74cf9a351130

However, when I install your version 0.3.1 using this command:

pip install spacy_universal_sentence_encoder-0.3.1.tar.gz

Then when I attempt to load the library, I get the following error:

----> 1 nlp = spacy.load("en_use_lg")

/usr/local/lib/python3.7/dist-packages/spacy/__init__.py in load(name, **overrides)
     28     if depr_path not in (True, False, None):
     29         warnings.warn(Warnings.W001.format(path=depr_path), DeprecationWarning)
---> 30     return util.load_model(name, **overrides)
     31 
     32 

/usr/local/lib/python3.7/dist-packages/spacy/util.py in load_model(name, **overrides)
    168             return load_model_from_link(name, **overrides)
    169         if is_package(name):  # installed as package
--> 170             return load_model_from_package(name, **overrides)
    171         if Path(name).exists():  # path to model data directory
    172             return load_model_from_path(Path(name), **overrides)

/usr/local/lib/python3.7/dist-packages/spacy/util.py in load_model_from_package(name, **overrides)
    189     """Load a model from an installed package."""
    190     cls = importlib.import_module(name)
--> 191     return cls.load(**overrides)
    192 
    193 

/usr/local/lib/python3.7/dist-packages/en_use_lg/__init__.py in load(**overrides)
     10 
     11 def load(**overrides):
---> 12     return load_model_from_init_py(__file__, **overrides)

/usr/local/lib/python3.7/dist-packages/spacy/util.py in load_model_from_init_py(init_file, **overrides)
    237     if not model_path.exists():
    238         raise IOError(Errors.E052.format(path=path2str(data_path)))
--> 239     return load_model_from_path(data_path, meta, **overrides)
    240 
    241 

/usr/local/lib/python3.7/dist-packages/spacy/util.py in load_model_from_path(model_path, meta, **overrides)
    218             config.update(overrides)
    219             factory = factories.get(name, name)
--> 220             component = nlp.create_pipe(factory, config=config)
    221             nlp.add_pipe(component, name=name)
    222     return nlp.from_disk(model_path, exclude=disable)

/usr/local/lib/python3.7/dist-packages/spacy/language.py in create_pipe(self, name, config)
    308                 raise KeyError(Errors.E108.format(name=name))
    309             else:
--> 310                 raise KeyError(Errors.E002.format(name=name))
    311         factory = self.factories[name]
    312         return factory(self, **config)

KeyError: "[E002] Can't find factory for 'save_tfhub_model_url'. This usually happens when spaCy calls `nlp.create_pipe` with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to `Language.factories['save_tfhub_model_url']` or remove it from the model meta and add it via `nlp.add_pipe` instead."

I wanted to let you know, in case this is unexpected. For now, my original issue is solved and I am unblocked, so feel free to close this.

Thanks again! I appreciate the help.

MartinoMensio commented 3 years ago

Thanks @jrruethe for finding the path of the library installed on your docker container.

For your issue with the factory forsave_tfhub_model_url, it is a consequence of having an installed model of en_use_lg not matching with spacy_universal_sentence_encoder. In your case I suppose you have installed:

The check on dependencies is done when installing the standalone models (which you have installed first) and then you probably updated spacy_universal_sentence_encoder without updating the standalone model.

To check the installed versions you can run:

python -c "import spacy_universal_sentence_encoder; print(spacy_universal_sentence_encoder.__version__)"
python -c "import en_use_lg; print(en_use_lg.__version__)"

Using the same version number should solve your problem.

Martino

MartinoMensio commented 3 years ago

I just released v0.3.2 which makes use of the TFHUB_CACHE_DIR environment variable. I included in README.md under "Using a pre-downloaded model" the instructions.

For the docker case, you can modify the docker run command in the following way if you want to be less dependent on os (e.g., the site-packages vs dist-packages path):

docker run -it --rm -v `pwd`/models/universal_sentence_encoder:/SOME_SIMPLE_PATH_HERE -e TFHUB_CACHE_DIR=/SOME_SIMPLE_PATH_HERE --entrypoint /bin/bash nlp

The command will map the volume to a custom directory, which is also sent as environment variable so that TensorFlow Hub can locate it. SOME_SIMPLE_PATH_HERE can be any path instead of requiring to be the python libraries folder. At the same time, your current volume mapping still works with the newer version.

Martino

jrruethe commented 3 years ago

Ha, you are exactly right about my mismatched versions! I had intended to update both, but I think my image was docker-cached during the build.

Thanks again for the info, this is extremely helpful!

clippered commented 3 years ago

Hi @MartinoMensio, Thanks for writing such a great code. I know this issue is already closed but I have an issue that is related to this one. I tried to include the downloaded model into the docker container, then run the docker image from AWS Lambda. Unfortunately, AWS Lambda has a read-only filesystem except for the mounted /tmp folder which has 500MB. The model was downloaded to /opt/tfhub_cache and the error I'm getting when running the container is [ERROR] PermissionDeniedError: /opt/tfhub_cache/c9fe785512ca4a1b179831acb18a0c6bfba603dd.lock.tmp03dfa58baff947469914432fa2f56606; Read-only file system After some investigation, the "tensorflow_hub.load()" always creates a lock file even though the model already exists. Any way to workaround this? Thanks.

MartinoMensio commented 3 years ago

Hi @clippered , I have done some investigation related to your issue and from what I understood (keep in mind that I am not using AWS Labda and therefore I have no experience with it). You have two locations:

I tried reproducing the issue on my machine by removing write access to the folder where I have the models, but the tensorflow_hub.load() still continues successfully.

I have no idea on how to reproduce the issue locally.

Can you try to update to the latest version of tensorflow (2.4.0) which is supported by this library (version 0.4.0 just released) and see if the issue persists?

I found that there exists something called "file system access for lambda functions" https://docs.aws.amazon.com/lambda/latest/dg/configuration-filesystem.html so maybe this page could be related to your issue. If you then manage to mount a write-enabled file system (for example in /mnt/my_big_space), you can set the TFHUB_CACHE_DIR environment variable to point to it and it should work.

Martino

clippered commented 3 years ago

Thanks for your response @MartinoMensio. Unfortunately, I still got the same Permission denied error with this latest version. I think this is how tensorflow_hub works by design. It might not really be your issue.

Attaching EFS to AWS Lambda can be another option, like what you suggested. However, I would prefer to bake everything into the container's image instead as to not use any more AWS resources.

It is weird that you cannot reproduce it. Maybe another way to reproduce it is, download the model using a different user, making sure the location it stored it has readonly access for all other users, then run the python script to load the model using another user.

Anyway, don't worry too much about this and thanks for your help.

clippered commented 3 years ago

Just got the traceback for reference.

  File "/var/task/entry.py", line 13, in <module>
    NLP = spacy.load("en_use_lg")
  File "/var/lang/lib/python3.8/site-packages/spacy/__init__.py", line 47, in load
    return util.load_model(name, disable=disable, exclude=exclude, config=config)
  File "/var/lang/lib/python3.8/site-packages/spacy/util.py", line 322, in load_model
    return load_model_from_package(name, **kwargs)
  File "/var/lang/lib/python3.8/site-packages/spacy/util.py", line 355, in load_model_from_package
    return cls.load(vocab=vocab, disable=disable, exclude=exclude, config=config)
  File "/var/lang/lib/python3.8/site-packages/en_use_lg/__init__.py", line 10, in load
    return load_model_from_init_py(__file__, **overrides)
  File "/var/lang/lib/python3.8/site-packages/spacy/util.py", line 514, in load_model_from_init_py
    return load_model_from_path(
  File "/var/lang/lib/python3.8/site-packages/spacy/util.py", line 389, in load_model_from_path
    nlp = load_model_from_config(config, vocab=vocab, disable=disable, exclude=exclude)
  File "/var/lang/lib/python3.8/site-packages/spacy/util.py", line 426, in load_model_from_config
    nlp = lang_cls.from_config(
  File "/var/lang/lib/python3.8/site-packages/spacy/language.py", line 1650, in from_config
    nlp.add_pipe(
  File "/var/lang/lib/python3.8/site-packages/spacy/language.py", line 767, in add_pipe
    pipe_component = self.create_pipe(
  File "/var/lang/lib/python3.8/site-packages/spacy/language.py", line 658, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/var/lang/lib/python3.8/site-packages/thinc/config.py", line 721, in resolve
    resolved, _ = cls._make(
  File "/var/lang/lib/python3.8/site-packages/thinc/config.py", line 770, in _make
    filled, _, resolved = cls._fill(
  File "/var/lang/lib/python3.8/site-packages/thinc/config.py", line 842, in _fill
    getter_result = getter(*args, **kwargs)
  File "/var/lang/lib/python3.8/site-packages/spacy_universal_sentence_encoder/language.py", line 89, in use_model_factory
    model = UniversalSentenceEncoder(model_url, enable_cache, debug)
  File "/var/lang/lib/python3.8/site-packages/spacy_universal_sentence_encoder/language.py", line 104, in __init__
    _ = UniversalSentenceEncoder.get_model(self.model_url, self.enable_cache, self.debug)
  File "/var/lang/lib/python3.8/site-packages/spacy_universal_sentence_encoder/language.py", line 133, in get_model
    model = TFHubWrapper(use_model_url, enable_cache=enable_cache, debug=debug)
  File "/var/lang/lib/python3.8/site-packages/spacy_universal_sentence_encoder/language.py", line 177, in __init__
    self.model = hub.load(self.model_url)
  File "/var/lang/lib/python3.8/site-packages/tensorflow_hub/module_v2.py", line 92, in load
    module_path = resolve(handle)
  File "/var/lang/lib/python3.8/site-packages/tensorflow_hub/module_v2.py", line 47, in resolve
    return registry.resolver(handle)
  File "/var/lang/lib/python3.8/site-packages/tensorflow_hub/registry.py", line 51, in __call__
    return impl(*args, **kwargs)
  File "/var/lang/lib/python3.8/site-packages/tensorflow_hub/compressed_module_resolver.py", line 67, in __call__
    return resolver.atomic_download(handle, download, module_dir,
  File "/var/lang/lib/python3.8/site-packages/tensorflow_hub/resolver.py", line 370, in atomic_download
    tf_utils.atomic_write_string_to_file(lock_file, lock_contents,
  File "/var/lang/lib/python3.8/site-packages/tensorflow_hub/tf_utils.py", line 65, in atomic_write_string_to_file
    f.write(contents)
  File "/var/lang/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 102, in write
    self._prewrite_check()
  File "/var/lang/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 87, in _prewrite_check
    self._writable_file = _pywrap_file_io.WritableFile(