Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.45k stars 581 forks source link

feat/Use local model for hi_res partition #2631

Open AntoninLeroy opened 3 months ago

AntoninLeroy commented 3 months ago

Hello,

Maybe this feature already exist but I didn't manage to implement it. I work on a network that blocks huggingface and I would like to run:

elements = partition_pdf(filename=PDF_PATH, strategy='hi_res', infer_table_structure=True)

But the function cannot run because it's trying to access the yolox model on the hub:

SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /unstructuredio/yolo_x_layout/resolve/main/yolox_l0.05.onnx (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))"), '(Request ID: 757ef56e-88d9-4a7a-88ef-ff3fade2139c)')

My question is: If I manage to download the model on my machine somehow, how can use it with the ustructured library without having to call the https request ?

I hope my explainations are somehow ok.

Thanks in advance.

sanjaydokula commented 3 months ago

Sorry I don't have an answer but I'd like to ask, is it trying to download the model files or do inference elsewhere? Also saw a page from unstructured io docs that might help https://unstructured-io.github.io/unstructured/installation/full_installation.html#setting=up-unstructured-for-local-inference

AntoninLeroy commented 3 months ago

It's trying to download the model from hf yes it's the default behaviour or the unstructured library.

peixin-lin commented 3 months ago

I have encountered the same issue. I explored the source code a little bit and found these lines of code in site-packages\unstructured_inference\models\yolox.py (line 34 -59):

MODEL_TYPES = {
    "yolox": LazyDict(
        model_path=LazyEvaluateInfo(
            hf_hub_download,
            "unstructuredio/yolo_x_layout",
            "yolox_l0.05.onnx",
        ),
        label_map=YOLOX_LABEL_MAP,
    ),
    "yolox_tiny": LazyDict(
        model_path=LazyEvaluateInfo(
            hf_hub_download,
            "unstructuredio/yolo_x_layout",
            "yolox_tiny.onnx",
        ),
        label_map=YOLOX_LABEL_MAP,
    ),
    "yolox_quantized": LazyDict(
        model_path=LazyEvaluateInfo(
            hf_hub_download,
            "unstructuredio/yolo_x_layout",
            "yolox_l0.05_quantized.onnx",
        ),
        label_map=YOLOX_LABEL_MAP,
    ),
}

So I guess the models from huggingface are hard-coded to download from the url at the moment. Could we have the feature to config the models to some pre-downloaded path in the later version? Thanks!

kexuedaishu commented 2 months ago

met the same problem, want to add local model support

ericfeunekes commented 2 months ago

Not sure if it will help, but have you tried specifying the custom model (undocumented): https://github.com/Unstructured-IO/unstructured/pull/2462

scanny commented 1 month ago

I haven't dug into all the details of HuggingFace caching, but this page from their website seems like an excellent resource: https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache

I expect some sort of "download-separately-and-manually-install-into-cache" solution is possible. Probably not something we'll get to in the short-term, but if someone is willing to work out how to do that we'd be very interested in it. At least to have a working solution here that folks can get to on search, or possibly add to the documentation.

AntoninLeroy commented 1 month ago

Hey I haven't made any progress on this unfortunatelly I'm lost in the module code... There's probably a modification to do in the get_model() function on the unstructured-inference dependency.

The best implementation to me would be to simply pass the a new argument: "hi_res_model_path" to the partition_pdf function: elements = partition_pdf(filename=PDF_PATH, strategy='hi_res', hi_res_model_path='/path/to/model', infer_table_structure=True)

Anyone able to evaluate the amont of work needed to develop this ?

Vincewz commented 1 month ago

I encountered the same problem too

mmaryam2020 commented 1 month ago

I finally made it work locally with these changes. Depending on the setting you are using you need give local path as one of the input . You can download the models for once, and then save them, or just download them directly and give their path to it.

1)env/lib/python3.9/site-packages/unstructured_inference/models/tables.py

logger.info("Loading the table structure model ...")  
model_path = 'path to table_transformer_recognition'  
self.model = TableTransformerForObjectDetection.from_pretrained(model_path, use_pretrained_backbone=False)  
self.model.eval()  
#self.model.save_pretrained("path to save model")#this one can save the model the first time by getting connected to hub.( if you don't want to save the model directly you can use it once to download the model and then comment it.

In case you are using yolox in the setting, then you need to change this one 2) env/lib/python3.9/site-packages/unstructured_inference/models/yolox.py

MODEL_TYPES = {  
    "yolox": LazyDict(  
        model_path='path to yolox_l0.05.onnx folder downloaded from hugging face',  
        label_map=YOLOX_LABEL_MAP,  
    ),

Hope it can solve your issue

MthwRobinson commented 1 month ago

Thanks all, we'll plan to add better support for this.

zzw1123 commented 1 month ago

Same problem. Any progress there?

FennFlyer commented 1 month ago

I finally made it work locally with these changes. Depending on the setting you are using you need give local path as one of the input . You can download the models for once, and then save them, or just download them directly and give their path to it.

1)env/lib/python3.9/site-packages/unstructured_inference/models/tables.py

logger.info("Loading the table structure model ...")  
model_path = 'path to table_transformer_recognition'  
self.model = TableTransformerForObjectDetection.from_pretrained(model_path, use_pretrained_backbone=False)  
self.model.eval()  
#self.model.save_pretrained("path to save model")#this one can save the model the first time by getting connected to hub.( if you don't want to save the model directly you can use it once to download the model and then comment it.

In case you are using yolox in the setting, then you need to change this one 2) env/lib/python3.9/site-packages/unstructured_inference/models/yolox.py

MODEL_TYPES = {  
    "yolox": LazyDict(  
        model_path='path to yolox_l0.05.onnx folder downloaded from hugging face',  
        label_map=YOLOX_LABEL_MAP,  
    ),

Hope it can solve your issue

Thank you for posting, this worked for me as well!