DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
8.63k stars 409 forks source link

Complete text in rows #231

Open pankpy opened 1 week ago

pankpy commented 1 week ago

Thank you for the initiative. I am using it for table extraction and it is returning tables/dataframes as expected. However, it is not giving complete text in some rows or providing text in multiple lines. Is there any way to fix this?

cau-git commented 1 week ago

@pankpy Could you please provide an example to illustrate the behaviour? Thanks.

pankpy commented 1 week ago

Thank you. Please find attached files.

from docling.datamodel.base_models import InputFormat from docling.document_converter import ( DocumentConverter, PdfFormatOption, WordFormatOption, ) from docling.pipeline.simple_pipeline import SimplePipeline from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = False # Not using scanned documents pipeline_options.do_table_structure = True

doc_converter = ( DocumentConverter( # all of the below is optional, has internal defaults. allowed_formats=[ InputFormat.PDF, InputFormat.IMAGE, InputFormat.DOCX, InputFormat.HTML, InputFormat.PPTX, ], # whitelist formats, non-matching files are ignored. format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, # pipeline options go here. backend=PyPdfiumDocumentBackend # optional: pick an alternative backend ), InputFormat.DOCX: WordFormatOption( pipeline_cls=SimplePipeline # default for office formats and HTML ), }, ) )

###############

ConversionResult = doc_converter.convert("E:\zPankaj\Sample.pdf") # previously convert_single print(ConversionResult.document.export_to_markdown())

print('VERIFY RESULT',ConversionResult.document) print('RESULT TYPE',type(ConversionResult.document))

for i, table in enumerate(ConversionResult.document.tables): df = table.export_to_dataframe() print(df) df.to_excel(f'Output SampleS df{i}.xlsx') Sample.pdf Output Sample_S df_0.xlsx Output Sample_S df_1.xlsx Pycharm_prints