Closed ghnp5 closed 3 months ago
Spark NLP downloads, extracts, and save all the models and pipelines in the default place of ~/cache_pretrained
directory. It will never re-download anything, it says it is going to download it, but then it says it's done and starts loading it because it is already there.
The only issue can be if your Docker doesn't persist these models and next time there is no model to load. But reset assure, we have very large models and the logic is to first check, then download.
Is there an existing issue for this?
Who can help?
No response
What are you working on?
I have set up a Docker image that prepares everything, and a
server.py
which is very similar to the Quick Start program, but it creates a REST API as well.I'm using
spark-nlp==5.3.2 pyspark==3.3.1
and:Current Behavior
The jars and modules seem to be downloaded only once. The second time, everything loads much faster.
However, it appears that the model is always being re-downloaded:
Expected Behavior
I would expect the model to be cached somewhere, so the next time it will just "find" it.
Steps To Reproduce
Please let me know if you need any extra code, or if the above info is enough.
Spark NLP version and Apache Spark
5.3.2
Type of Spark Application
No response
Java Version
OpenJDK 11
Java Home Directory
/usr/lib/jvm/java-11-openjdk-amd64/
Setup and installation
flask/waitress, docker
Operating System and Version
ubuntu:latest
Link to your project (if available)
No response
Additional Information
No response