DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

analyzing the pdf is too slow #398

Closed langzichai closed 1 day ago

langzichai commented 1 day ago

Question

I have a need right now to just get the content of the pdf, but analyzing the file is too slow 63M file took more than 14 minutes. Please have to improve the speed of the method? Also confirm that the GPU is used by default? I found that there is no loss of GPU in use.

GPU:V100 32g

Please refer to the code `

   docs = [DocumentStream(name="uploaded_file.pdf", stream=file_stream)]
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.ocr_options.use_gpu = True
    pipeline_options.do_table_structure = False
    pipeline_options.table_structure_options.do_cell_matching = False
    doc_converter = DocumentConverter(
        allowed_formats=[
            InputFormat.PDF
        ],
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options
            )
        }
    )

    # Convert and process the document
    conv_results = doc_converter.convert_all(docs)
    for res in conv_results:
        logger.info(f"Document {res.input.file.name} converted.")
        content = res.document.export_to_text()
        end_time = datetime.now()
        logger.info(f"Conversion ended at {end_time}")
        logger.info(f"Total processing time: {end_time - start_time}")`
dolfim-ibm commented 1 day ago

You can read about our optimization and performance benchmarks here:

PeterStaar-IBM commented 1 day ago

This will be all addressed in the new "Docling Technical Report v2", including an optimization for GPU's. I will close for now and please follow up in the discussion.