DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
11.17k stars 544 forks source link

docling vs GROBID #74

Open sdspieg opened 2 months ago

sdspieg commented 2 months ago

Issue: Comparing GROBID and Docling for Parsing Scholarly Publications

My Use Case

We need to parse and extract all relevant information from (1000s) of scholarly publications, such as metadata, full text, titles, references, tables, and more. For this purpose, we've been using GROBID, an open-source library for extracting structured information from scientific and scholarly documents, especially PDFs. GROBID has been utilized in the Semantic Scholar Open Research Corpus (S2ORC), a massive dataset of scholarly articles maintained by the Allen Institute for AI (AI2). GROBID has proven to be robust in handling different document layouts and complexities - for us as well!

Recently, I came across Docling, and so I decided to evaluate Docling's performance and output quality to understand its potential advantages or limitations compared to GROBID.

Experiment Setup

To compare GROBID and Docling fairly, I conducted an experiment using the same 10 randomly selected scholarly PDFs, each larger than 30 KB. Here is a summary of the setup:

  1. Environment Setup:

    • Both GROBID and Docling were installed locally, ensuring they could utilize available GPU resources (NVIDIA CUDA).
    • GROBID was run in a Docker container, while Docling was installed in a Python virtual environment with GPU acceleration enabled (I THINK!).
  2. Data Selection:

    • Ten random PDFs larger than 30 KB were selected from a dataset of scholarly publications, representing various lengths and complexities, focusing on documents with rich metadata, references, and tables.
  3. Conversion Process:

    • GROBID: The /api/processFulltextDocument endpoint was used to extract metadata, full text, references, and other elements. Results were saved in XML format, and performance metrics, such as processing time per document, were recorded.
    • Docling: Its Python API was used with OCR and table structure detection enabled. Docling processed each PDF, exporting the results in JSON, Markdown, and other formats. Performance metrics were also captured.
  4. Output and Comparison:

    • Two separate output directories were created: one for GROBID's results (converted_output_grobid) and one for Docling's results (converted_output_docling).
    • A consolidated JSON file was generated to compare both sets of results, including processing time and any errors encountered.
    • The results will also be evaluated by a frontier large language model (LLM) to assess the quality of the parsing, providing a more nuanced understanding of how each library handles different elements (e.g., metadata, references, tables).

Questions and Observations

My early experiments show that GROBID is significantly faster than Docling. I suspect Docling appears to use the GPU less effectively, which may explain the (MUCH) longer processing times.

Given this, I have the following questions:

  1. Is My Setup Fair and Correct?

    • Are GROBID and Docling configured correctly for a fair comparison?
    • Are there any changes or optimizations that I should consider to ensure both libraries run optimally?
  2. Can I Optimize GPU Usage for Docling?

    • Docling seems to be using my GPU less efficiently than GROBID. Are there any parameters or configurations that could improve its performance?
  3. Does Docling Focus on Different Use Cases?

    • It seems that Docling may be optimized for use cases other than academic document parsing, such as enterprise data extraction. I would appreciate it if the developers could clarify:
      • Should Docling be expected to perform well on scholarly documents?
      • Are there any specific settings or adjustments I should make to give Docling a more equitable opportunity in this comparison?

Additional Context

Thank you for your insights!

sdspieg commented 2 months ago

Insights from the GROBID vs. Docling Performance Comparison (so far)

The following visualizations provide insights into the performance differences between GROBID and Docling for processing scholarly PDFs. The comparison is based on the processing time for ten randomly selected PDFs, each larger than 30 KB.

1. Total Processing Time for GROBID vs. Docling

output-total

This bar chart shows the total processing time (in seconds) for all ten PDFs combined. GROBID is much faster than Docling, with a significantly lower total processing time. This suggests that GROBID is better optimized for the task of extracting information from scholarly PDFs, utilizing available resources more efficiently.

2. Processing Time per Document (Linear Scale)

output

This bar chart compares the processing times (in seconds) for each individual PDF file on a linear scale. The results highlight that GROBID is consistently faster across all documents. However, due to the large disparity in processing times, especially for some of the larger files, the bars for Docling are much taller, making GROBID's performance appear almost negligible in some cases.

3. Processing Time per Document (Logarithmic Scale)output-log

To provide a clearer view of the differences in performance, especially where the gap is substantial, we use a logarithmic scale in this chart. The logarithmic scale compresses the large differences and makes the variations in processing time more visible across all documents. Even on this scale, GROBID demonstrates consistently lower processing times, while Docling’s processing times are significantly higher, particularly for more complex or longer documents.

Summary

Overall, GROBID proves to be much faster and more efficient in processing scholarly PDFs compared to Docling. The total processing time chart illustrates a stark difference, while the document-by-document comparisons confirm GROBID's superior performance across a variety of document sizes and complexities. The logarithmic chart further demonstrates the dramatic performance gap, making it easier to visualize how much longer Docling takes for certain documents. This analysis suggests that while GROBID is well-optimized for scholarly documents, Docling may require further adjustments or optimizations to improve its performance in similar use cases.

And BTW - here's my Jupyter Notebook

PeterStaar-IBM commented 2 months ago

@sdspieg This is a great start, thank you! First things first, I think we need to look at two things,

  1. the output produced from both packages on the test-set (could you specify the exact pdf's you have used in the benchmark). For example, we know the table-model will consume most of the time, and it might be that GROBID has no table-structure (or a very approximate one).
  2. the set-up: currently, docling uses a batched approach (single-thread, maximum number of documents) and not a focus on time-to-solution (maximum number threads/document). This is why time/document is probably not ideal. We should look at time/bact-of-documents

Nevertheless, I think this is a great starting point and we are looking forward to become competitive with GROBID!

cau-git commented 2 months ago

@sdspieg Great to see your investigation with docling and GROBID. Let me answer to a few points of you first:

  1. Is My Setup Fair and Correct?

I was checking the provided Jupyter Notebook and found a few things that impact the comparison. These are:

I uploaded a copy of the notebook code updated with these fixes: https://ibm.box.com/s/blk8ttyjsm9505t2ye2g3y1cbjcnzzlf

Additionally, you can set the env var OMP_NUM_THREADS to the number of cores your CPU has. Otherwise docling will default to only 4 threads.

  1. Can I Optimize GPU Usage for Docling?

Docling's GPU support is currently experimental, we are working on improving utilization. If you want to explicitly disable GPU, you can also set the env var USE_CPU_ONLY to true.

  1. Does Docling Focus on Different Use Cases? Should Docling be expected to perform well on scholarly documents

Docling is built to perform well on virtually any type of document, but it performs of course better on some than on others. Scholarly documents are definitely among the documents it should perform well on. A good distribution of documents it was trained on is seen in our DocLayNet dataset.

Meanwhile, I am working on reproducing with the provided PDFs a benchmark measurement on my end. Currently, I am blocked by the fact that GROBID apparently provides only linux x86 docker images (none for linux/arm64), and I can therefore not run it natively on an Apple Silicon Mac.