DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
12.07k stars 604 forks source link

Unable to run. #262

Closed ashunaveed closed 2 weeks ago

ashunaveed commented 3 weeks ago

Bug

PS C:\Users\genco> & C:/ProgramData/anaconda3/envs/docling/python.exe c:/Users/genco/OneDrive/Documents/marker_new/docling_convertor_testing.py Fetching 9 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<?, ?it/s] Traceback (most recent call last): File "c:\Users\genco\OneDrive\Documents\marker_new\docling_convertor_testing.py", line 5, in result = converter.convert(source) File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\pydantic\validate_call_decorator.py", line 60, in wrapper_function return validate_call_wrapper(*args, **kwargs) File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\pydantic_internal_validate_call.py", line 96, in call res = self.pydantic_validator.validate_python(pydantic_core.ArgsKwargs(args, kwargs)) File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\docling\document_converter.py", line 161, in convert return next(all_res) File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\docling\document_converter.py", line 180, in convert_all for conv_res in conv_res_iter: File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\docling\document_converter.py", line 211, in _convert for item in map( File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\docling\document_converter.py", line 255, in _process_document conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error) File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\docling\document_converter.py", line 263, in _execute_pipeline pipeline = self._get_pipeline(in_doc.format) File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\docling\document_converter.py", line 244, in _get_pipeline self.initialized_pipelines[pipeline_class] = pipeline_class( File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\docling\pipeline\standard_pdf_pipeline.py", line 54, in init__ self.glm_model = GlmModel(options=GlmOptions()) File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\docling\models\ds_glm_model.py", line 46, in init__ load_pretrained_nlp_models() File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\deepsearch_glm\utils\load_pretrained_models.py", line 120, in load_pretrained_nlp_models done, data = download_items(downloads) File "C:\ProgramData\anaconda3\envs\docling\lib\site-packages\deepsearch_glm\utils\load_pretrained_models.py", line 50, in download_items with target.open("wb") as fw: File "C:\ProgramData\anaconda3\envs\docling\lib\pathlib.py", line 1119, in open return self._accessor.open(self, mode, buffering, encoding, errors, PermissionError: [Errno 13] Permission denied: 'C:\ProgramData\anaconda3\envs\docling\lib\site-packages\deepsearch_glm\resources\models\crf\part-of-speech\crf_pos_model_en.bin'

Steps to reproduce

run code: from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL converter = DocumentConverter() result = converter.convert(source) print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]"

Docling version

latest version.

Python version

3.10.15

PeterStaar-IBM commented 3 weeks ago

@ashunaveed Can you please tell us the exact version. There should be no need to download crf_pos_model_en.bin.

Please run,

docling --version

We suspect that you have by chance an older version, but we want to be 100% sure.

erikmargaronis commented 2 weeks ago

I'm trying to run Docling on a server without internet connection so I have downloaded the layout model and tableformer from Hugging Face and then I try to run with custom artifact path as per your documentation:

pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path)
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

But I get an error similar to the OP (though for me the problem is timeout due to connection error).

I have tried with these versions: Docling version: 2.5.1 Docling Core version: 2.3.2 Docling IBM Models version: 2.0.3 Docling Parse version: 2.0.3

and an older version: Docling version: 2.3.1 Docling Core version: 2.3.1 Docling IBM Models version: 2.0.3 Docling Parse version: 2.0.2

And it tries to download the glm files in both versions.

I'm mostly curious to understand if the GLM files are needed as your answer above indicates that, at least crf_pos_model_en.bin, shouldn't be needed at all.

dolfim-ibm commented 2 weeks ago

I think we found the issue, see PR https://github.com/DS4SD/docling/pull/322.

erikmargaronis commented 2 weeks ago

Wonderful! This seems to fix the issue! Thanks for quick response! :)