It seems the model is downloaded every time the program starts - any way to cache?

Is there an existing issue for this?

[X] I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I have set up a Docker image that prepares everything, and a server.py which is very similar to the Quick Start program, but it creates a REST API as well.

I'm using spark-nlp==5.3.2 pyspark==3.3.1 and:

spark = sparknlp.start()
language_detector_pipeline = PretrainedPipeline('detect_language_43', lang='xx')

Current Behavior

The jars and modules seem to be downloaded only once. The second time, everything loads much faster.

However, it appears that the model is always being re-downloaded:

 0 artifacts copied, 78 already retrieved (0kB/20ms)
24/04/02 23:36:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
detect_language_43 download started this may take some time.
Approx size to download 8.1 MB
[ / ]detect_language_43 download started this may take some time.
Approximate size to download 8.1 MB
[ — ]Download done! Loading the resource.

Expected Behavior

I would expect the model to be cached somewhere, so the next time it will just "find" it.

Steps To Reproduce

Please let me know if you need any extra code, or if the above info is enough.

Spark NLP version and Apache Spark

5.3.2

Type of Spark Application

No response

Java Version

OpenJDK 11

Java Home Directory

/usr/lib/jvm/java-11-openjdk-amd64/

Setup and installation

flask/waitress, docker

Operating System and Version

ubuntu:latest

Link to your project (if available)

No response

Additional Information

No response

JohnSnowLabs / spark-nlp