JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.77k stars 705 forks source link

It seems the model is downloaded every time the program starts - any way to cache? #14223

Closed ghnp5 closed 3 months ago

ghnp5 commented 3 months ago

Is there an existing issue for this?

Who can help?

No response

What are you working on?

I have set up a Docker image that prepares everything, and a server.py which is very similar to the Quick Start program, but it creates a REST API as well.

I'm using spark-nlp==5.3.2 pyspark==3.3.1 and:

spark = sparknlp.start()
language_detector_pipeline = PretrainedPipeline('detect_language_43', lang='xx')

Current Behavior

The jars and modules seem to be downloaded only once. The second time, everything loads much faster.

However, it appears that the model is always being re-downloaded:

 0 artifacts copied, 78 already retrieved (0kB/20ms)
24/04/02 23:36:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
detect_language_43 download started this may take some time.
Approx size to download 8.1 MB
[ / ]detect_language_43 download started this may take some time.
Approximate size to download 8.1 MB
[ — ]Download done! Loading the resource.

Expected Behavior

I would expect the model to be cached somewhere, so the next time it will just "find" it.

Steps To Reproduce

Please let me know if you need any extra code, or if the above info is enough.

Spark NLP version and Apache Spark

5.3.2

Type of Spark Application

No response

Java Version

OpenJDK 11

Java Home Directory

/usr/lib/jvm/java-11-openjdk-amd64/

Setup and installation

flask/waitress, docker

Operating System and Version

ubuntu:latest

Link to your project (if available)

No response

Additional Information

No response

maziyarpanahi commented 3 months ago

Spark NLP downloads, extracts, and save all the models and pipelines in the default place of ~/cache_pretrained directory. It will never re-download anything, it says it is going to download it, but then it says it's done and starts loading it because it is already there.

The only issue can be if your Docker doesn't persist these models and next time there is no model to load. But reset assure, we have very large models and the logic is to first check, then download.