deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.13k stars 655 forks source link

HuggingFace model multilingual-e5-small fails to open on Windows due to max path limitations #3048

Closed david-sitsky closed 7 months ago

david-sitsky commented 7 months ago

Description

While I work on Linux, I have to write software that also works on Windows. The multilingual-e5-small model has a large number of properties, which are the number of languages it supports. You can see this when deploying this model on Linux:

DEBUG DefaultModelZoo Scanning models in repo: class ai.djl.repository.RemoteRepository, djl://ai.djl.huggingface.pytorch/intfloat/multilingual-e5-small?translatorFactory=ai.djl.translate.NoopServingTranslatorFactory
INFO  ModelInfo Loading model multilingual_e5_small on cpu()
DEBUG ModelZoo Loading model with Criteria:
    Application: UNDEFINED
    Input: class ai.djl.modality.Input
    Output: class ai.djl.modality.Output
    Engine: PyTorch
    ModelZoo: ai.djl.localmodelzoo
    Arguments: {"padding":"true","engine":"PyTorch","translatorFactory":"ai.djl.huggingface.translator.TextEmbeddingTranslatorFactory"}
    Options: {"modelName":"multilingual-e5-small","mapLocation":"true"}
    No translator supplied

DEBUG ModelZoo Searching model in specified model zoo: ai.djl.localmodelzoo
DEBUG ModelZoo Checking ModelLoader: ai.djl.huggingface.pytorch:intfloat/multilingual-e5-small NLP.TEXT_EMBEDDING [
    ai.djl.huggingface.pytorch/intfloat/multilingual-e5-small/0.0.1/multilingual-e5-small {"multilingual":"true","af":"true","am":"true","ar":"true","as":"true","az":"true","be":"true","bg":"true","bn":"true","br":"true","bs":"true","ca":"true","cs":"true","cy":"true","da":"true","de":"true","el":"true","en":"true","eo":"true","es":"true","et":"true","eu":"true","fa":"true","fi":"true","fr":"true","fy":"true","ga":"true","gd":"true","gl":"true","gu":"true","ha":"true","he":"true","hi":"true","hr":"true","hu":"true","hy":"true","id":"true","is":"true","it":"true","ja":"true","jv":"true","ka":"true","kk":"true","km":"true","kn":"true","ko":"true","ku":"true","ky":"true","la":"true","lo":"true","lt":"true","lv":"true","mg":"true","mk":"true","ml":"true","mn":"true","mr":"true","ms":"true","my":"true","ne":"true","nl":"true","no":"true","om":"true","or":"true","pa":"true","pl":"true","ps":"true","pt":"true","ro":"true","ru":"true","sa":"true","sd":"true","si":"true","sk":"true","sl":"true","so":"true","sq":"true","sr":"true","su":"true","sv":"true","sw":"true","ta":"true","te":"true","th":"true","tl":"true","tr":"true","ug":"true","uk":"true","ur":"true","uz":"true","vi":"true","xh":"true","yi":"true","zh":"true"}
]
DEBUG MRL Preparing artifact: djl://ai.djl.huggingface.pytorch/intfloat/multilingual-e5-small?translatorFactory=ai.djl.translate.NoopServingTranslatorFactory, ai.djl.huggingface.pytorch/intfloat/multilingual-e5-small/0.0.1/multilingual-e5-small {"multilingual":"true","af":"true","am":"true","ar":"true","as":"true","az":"true","be":"true","bg":"true","bn":"true","br":"true","bs":"true","ca":"true","cs":"true","cy":"true","da":"true","de":"true","el":"true","en":"true","eo":"true","es":"true","et":"true","eu":"true","fa":"true","fi":"true","fr":"true","fy":"true","ga":"true","gd":"true","gl":"true","gu":"true","ha":"true","he":"true","hi":"true","hr":"true","hu":"true","hy":"true","id":"true","is":"true","it":"true","ja":"true","jv":"true","ka":"true","kk":"true","km":"true","kn":"true","ko":"true","ku":"true","ky":"true","la":"true","lo":"true","lt":"true","lv":"true","mg":"true","mk":"true","ml":"true","mn":"true","mr":"true","ms":"true","my":"true","ne":"true","nl":"true","no":"true","om":"true","or":"true","pa":"true","pl":"true","ps":"true","pt":"true","ro":"true","ru":"true","sa":"true","sd":"true","si":"true","sk":"true","sl":"true","so":"true","sq":"true","sr":"true","su":"true","sv":"true","sw":"true","ta":"true","te":"true","th":"true","tl":"true","tr":"true","ug":"true","uk":"true","ur":"true","uz":"true","vi":"true","xh":"true","yi":"true","zh":"true"}
DEBUG AbstractRepository Files have been downloaded already: /data/djl-serving/cache/cache/repo/model/nlp/text_embedding/ai/djl/huggingface/pytorch/intfloat/multilingual-e5-small/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/true/0.0.1

Expected Behavior

That the model can be loaded on Windows.

Error Message

ai.djl.engine.EngineException: open file failed because of errno 2 on fopen: No such file or directory, file path: C:\Users\sits\.djl.ai\cache\repo\model\nlp\text_embedding\ai\djl\huggingface\pytorch\intfloat\multilingual-e5-small\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\true\0.0.1\multilingual-e5-small.pt
    at ai.djl.pytorch.jni.PyTorchLibrary.moduleLoad(Native Method) ~[pytorch-engine-0.26.0.jar:?]
    at ai.djl.pytorch.jni.JniUtils.loadModule(JniUtils.java:1742) ~[pytorch-engine-0.26.0.jar:?]
    at ai.djl.pytorch.engine.PtModel.load(PtModel.java:93) ~[pytorch-engine-0.26.0.jar:?]
    at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:166) ~[api-0.26.0.jar:?]
    at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:172) ~[api-0.26.0.jar:?]

How to Reproduce?

Open the above model on a Windows machine.

Thoughts

Why are model properties being represented as sub-directories? This seems to be an expensive way to do so, when a properties file would take less filesystem resources? Also various other filesystems have limits which are likely to be hit by this way of representing things.

Can this be easily changed, or is this a more involved change? I'm happy to have a look if some guidance can be provided.

zachgk commented 7 months ago

For a workaround, you can modify the metadata.json and remove the properties. You may also need to change file references from relative to absolute. Loading this would avoid the long path problems.

In terms of the larger usage of properties inside model paths, the main reason we do so is to differentiate different artifacts within the same metadata. Properties make it work with a fairly clean directory tree for most cases. This also only applies for the local cache of downloaded models and datasets. The code for it is in Artifact.getResourceUri() if you are interested.

For a complete solution, we would need to replace it with a new system. The obvious one is to use the artifact name instead of the properties. This requires that all artifacts have names and that they are unique within a metadata.json file. I am not sure off the top of my head that this is true, so we would have to verify. Assuming that is fine, it should be a fairly easy change. @frankfliu what do you think?

frankfliu commented 7 months ago

the properties are generated by model importing tool, we need to update the tool to combine languages into a single property.

We can manually change the metadata.json for now.

frankfliu commented 7 months ago

I did a scan of existing model zoo, all models (except mxnet yolo) artifact name + version are unique in metadata.json.

We actually don't need to use properties as file path to avoid file path clash.

david-sitsky commented 7 months ago

@frankfliu - many thanks for fixing this. I can confirm on Windows this now works as expected using 0.27.0-SNAPSHOT.