Three documents were obtained from web pages by using the commands File -> Print -> Save As of Firefox (on MacOS). The documents were subsequently converted into JSON files with this snippet of code:
pdf_files = os.listdir(utd_directory)
pipeline_options = PipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # use more accurate TableFormer model
converter = DocumentConverter(
pipeline_options=pipeline_options,
)
Conversion was successful despite the following warnings:
/Users/maurogatti/anaconda3/envs/rag_10/lib/python3.10/site-packages/easyocr/detection.py:85: FutureWarning: You are usingtorch.loadwithweights_only=False(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_onlywill be flipped toTrue. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user viatorch.serialization.add_safe_globals. We recommend you start settingweights_only=Truefor any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. net.load_state_dict(copyStateDict(torch.load(trained_model, map_location=device))) /Users/maurogatti/anaconda3/envs/rag_10/lib/python3.10/site-packages/easyocr/recognition.py:182: FutureWarning: You are usingtorch.loadwithweights_only=False(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_onlywill be flipped toTrue. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user viatorch.serialization.add_safe_globals. We recommend you start settingweights_only=Truefor any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. model.load_state_dict(torch.load(model_path, map_location=device))
However while loading the converted documents with LangChain with this snippet of code
It is unclear whether this is a docling or a langchain problem, but it raises doubts on correct conversion of documents generated with File -> Print -> Save as PDF.
Three documents were obtained from web pages by using the commands File -> Print -> Save As of Firefox (on MacOS). The documents were subsequently converted into JSON files with this snippet of code:
Conversion was successful despite the following warnings:
However while loading the converted documents with LangChain with this snippet of code
I obtain the following messages:
It is unclear whether this is a docling or a langchain problem, but it raises doubts on correct conversion of documents generated with File -> Print -> Save as PDF.