DS4SD / docling

Get your docs ready for gen AI
https://ds4sd.github.io/docling
MIT License
718 stars 71 forks source link

Conversion error? #165

Open maurogatti opened 5 days ago

maurogatti commented 5 days ago

Three documents were obtained from web pages by using the commands File -> Print -> Save As of Firefox (on MacOS). The documents were subsequently converted into JSON files with this snippet of code:

pdf_files = os.listdir(utd_directory) pipeline_options = PipelineOptions(do_table_structure=True) pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model converter = DocumentConverter( pipeline_options=pipeline_options, )

Conversion was successful despite the following warnings:

/Users/maurogatti/anaconda3/envs/rag_10/lib/python3.10/site-packages/easyocr/detection.py:85: FutureWarning: You are usingtorch.loadwithweights_only=False(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_onlywill be flipped toTrue. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user viatorch.serialization.add_safe_globals. We recommend you start settingweights_only=Truefor any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. net.load_state_dict(copyStateDict(torch.load(trained_model, map_location=device))) /Users/maurogatti/anaconda3/envs/rag_10/lib/python3.10/site-packages/easyocr/recognition.py:182: FutureWarning: You are usingtorch.loadwithweights_only=False(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_onlywill be flipped toTrue. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user viatorch.serialization.add_safe_globals. We recommend you start settingweights_only=Truefor any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. model.load_state_dict(torch.load(model_path, map_location=device))

However while loading the converted documents with LangChain with this snippet of code

loader = PyPDFDirectoryLoader(utd_directory) langchain_utd_docs = loader.load_and_split()

I obtain the following messages:

Ignoring wrong pointing object 41 0 (offset 0) Ignoring wrong pointing object 513 0 (offset 0) Ignoring wrong pointing object 1264 0 (offset 0) Ignoring wrong pointing object 74 0 (offset 0) Ignoring wrong pointing object 393 0 (offset 0) Ignoring wrong pointing object 399 0 (offset 0) Ignoring wrong pointing object 431 0 (offset 0) Ignoring wrong pointing object 86 0 (offset 0) Ignoring wrong pointing object 502 0 (offset 0) Ignoring wrong pointing object 542 0 (offset 0)

It is unclear whether this is a docling or a langchain problem, but it raises doubts on correct conversion of documents generated with File -> Print -> Save as PDF.

PeterStaar-IBM commented 5 days ago

@maurogatti thanks, we will look into it!