Open pankpy opened 1 week ago
@pankpy Could you please provide an example to illustrate the behaviour? Thanks.
Thank you. Please find attached files.
from docling.datamodel.base_models import InputFormat from docling.document_converter import ( DocumentConverter, PdfFormatOption, WordFormatOption, ) from docling.pipeline.simple_pipeline import SimplePipeline from docling.datamodel.pipeline_options import PdfPipelineOptions from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = False # Not using scanned documents pipeline_options.do_table_structure = True
doc_converter = ( DocumentConverter( # all of the below is optional, has internal defaults. allowed_formats=[ InputFormat.PDF, InputFormat.IMAGE, InputFormat.DOCX, InputFormat.HTML, InputFormat.PPTX, ], # whitelist formats, non-matching files are ignored. format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, # pipeline options go here. backend=PyPdfiumDocumentBackend # optional: pick an alternative backend ), InputFormat.DOCX: WordFormatOption( pipeline_cls=SimplePipeline # default for office formats and HTML ), }, ) )
###############
ConversionResult = doc_converter.convert("E:\zPankaj\Sample.pdf") # previously convert_single
print(ConversionResult.document.export_to_markdown())
print('VERIFY RESULT',ConversionResult.document) print('RESULT TYPE',type(ConversionResult.document))
for i, table in enumerate(ConversionResult.document.tables): df = table.export_to_dataframe() print(df) df.to_excel(f'Output SampleS df{i}.xlsx') Sample.pdf Output Sample_S df_0.xlsx Output Sample_S df_1.xlsx
Thank you for the initiative. I am using it for table extraction and it is returning tables/dataframes as expected. However, it is not giving complete text in some rows or providing text in multiple lines. Is there any way to fix this?