Cache mechanism is not working related to metadata.json in s3

mehmetbutgul commented 3 months ago

Is there an existing issue for this?

[X] I have searched the existing issues and did not find a match.

Who can help?

@danilojsl

What are you working on?

I am working on developing code.

Current Behavior

The cache mechanism related to metadata.json in s3 is not working. When the pretrained() method invokes, metadata.json downloads again, again... For example; I want to download a model with an approximate size of 3 MB, But the source code downloads ~10 MB metadata.json for every model Whenever pretrained() is used. Actually, the main problem is that there is a cache but the cache is not working and is never updated.

Expected Behavior

When the metadata.json downloads, keep it in the cache; and check the cache for every download process. Also, duration time is important. Maybe the duration time can be increased to 10 minutes.

Steps To Reproduce

need to debug for reproduction.

AAA.pretrained(), BBB.pretrained(), ....

private val repoFolder2Metadata: mutable.Map[String, RepositoryMetadata] = https://github.com/JohnSnowLabs/spark-nlp/blob/6b181a6ff77925144acc41c140f44449001b7083/src/main/scala/com/johnsnowlabs/nlp/pretrained/S3ResourceDownloader.scala#L40 This above variable is not updated in the source code.

https://github.com/JohnSnowLabs/spark-nlp/blob/6b181a6ff77925144acc41c140f44449001b7083/src/main/scala/com/johnsnowlabs/nlp/pretrained/S3ResourceDownloader.scala#L57 if (!needToRefresh) { // The condition is always false !!!

Need to consider different folder possibilities such as public/models, clinical/models

Spark NLP version and Apache Spark

sparknlp==5.3.1 / 5.3.2 Apache Spark version is not important to reproduce.

Type of Spark Application

No response

Java Version

Java 8

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

maziyarpanahi commented 3 months ago

Do you have any PR to suggest some caching mechanism? It has to be:

time base (expires after a duration)
session base (if the session dies and we are in a new session, we MUST download a new metadata.json)
there must be a force_download Boolean for users to override, either enable caching or disable it to be in the default behavior now. (only the code suggests we cache the metadata.json, we have never mentioned this in our docs or anywhere else. As far as anybody knows, if we make a change in metadata.json you'll see it immediately!)

That said, the very best solution is to check metadata.json, if it was updated, we MUST download it. If the file hasn't changed, we shall skip it in that session/application cycle.

mehmetbutgul commented 3 months ago

Hi @maziyarpanahi, Thanks for your comments. After your comments, I made a PR for the issue. I implemented your last idea. That said, the very best solution is to check metadata.json, if it was updated, we MUST download it. If the file hasn't changed, we shall skip it in that session/application cycle. I agree with you. This idea seems to be the best solution. PR --> https://github.com/JohnSnowLabs/spark-nlp/pull/14224

maziyarpanahi commented 3 months ago

Many thanks @mehmetbutgul - I left it to Danilo to review it, I make sure to include it in the tomorrow's release. Thanks agian for your contribution. 🚀

JohnSnowLabs / spark-nlp