Open sdspieg opened 2 months ago
The following visualizations provide insights into the performance differences between GROBID and Docling for processing scholarly PDFs. The comparison is based on the processing time for ten randomly selected PDFs, each larger than 30 KB.
This bar chart shows the total processing time (in seconds) for all ten PDFs combined. GROBID is much faster than Docling, with a significantly lower total processing time. This suggests that GROBID is better optimized for the task of extracting information from scholarly PDFs, utilizing available resources more efficiently.
This bar chart compares the processing times (in seconds) for each individual PDF file on a linear scale. The results highlight that GROBID is consistently faster across all documents. However, due to the large disparity in processing times, especially for some of the larger files, the bars for Docling are much taller, making GROBID's performance appear almost negligible in some cases.
To provide a clearer view of the differences in performance, especially where the gap is substantial, we use a logarithmic scale in this chart. The logarithmic scale compresses the large differences and makes the variations in processing time more visible across all documents. Even on this scale, GROBID demonstrates consistently lower processing times, while Docling’s processing times are significantly higher, particularly for more complex or longer documents.
Overall, GROBID proves to be much faster and more efficient in processing scholarly PDFs compared to Docling. The total processing time chart illustrates a stark difference, while the document-by-document comparisons confirm GROBID's superior performance across a variety of document sizes and complexities. The logarithmic chart further demonstrates the dramatic performance gap, making it easier to visualize how much longer Docling takes for certain documents. This analysis suggests that while GROBID is well-optimized for scholarly documents, Docling may require further adjustments or optimizations to improve its performance in similar use cases.
And BTW - here's my Jupyter Notebook
@sdspieg This is a great start, thank you! First things first, I think we need to look at two things,
Nevertheless, I think this is a great starting point and we are looking forward to become competitive with GROBID!
@sdspieg Great to see your investigation with docling and GROBID. Let me answer to a few points of you first:
- Is My Setup Fair and Correct?
I was checking the provided Jupyter Notebook and found a few things that impact the comparison. These are:
I uploaded a copy of the notebook code updated with these fixes: https://ibm.box.com/s/blk8ttyjsm9505t2ye2g3y1cbjcnzzlf
Additionally, you can set the env var OMP_NUM_THREADS
to the number of cores your CPU has. Otherwise docling will default to only 4 threads.
- Can I Optimize GPU Usage for Docling?
Docling's GPU support is currently experimental, we are working on improving utilization. If you want to explicitly disable GPU, you can also set the env var USE_CPU_ONLY
to true
.
- Does Docling Focus on Different Use Cases? Should Docling be expected to perform well on scholarly documents
Docling is built to perform well on virtually any type of document, but it performs of course better on some than on others. Scholarly documents are definitely among the documents it should perform well on. A good distribution of documents it was trained on is seen in our DocLayNet dataset.
Meanwhile, I am working on reproducing with the provided PDFs a benchmark measurement on my end. Currently, I am blocked by the fact that GROBID apparently provides only linux x86 docker images (none for linux/arm64), and I can therefore not run it natively on an Apple Silicon Mac.
Issue: Comparing GROBID and Docling for Parsing Scholarly Publications
My Use Case
We need to parse and extract all relevant information from (1000s) of scholarly publications, such as metadata, full text, titles, references, tables, and more. For this purpose, we've been using GROBID, an open-source library for extracting structured information from scientific and scholarly documents, especially PDFs. GROBID has been utilized in the Semantic Scholar Open Research Corpus (S2ORC), a massive dataset of scholarly articles maintained by the Allen Institute for AI (AI2). GROBID has proven to be robust in handling different document layouts and complexities - for us as well!
Recently, I came across Docling, and so I decided to evaluate Docling's performance and output quality to understand its potential advantages or limitations compared to GROBID.
Experiment Setup
To compare GROBID and Docling fairly, I conducted an experiment using the same 10 randomly selected scholarly PDFs, each larger than 30 KB. Here is a summary of the setup:
Environment Setup:
Data Selection:
Conversion Process:
/api/processFulltextDocument
endpoint was used to extract metadata, full text, references, and other elements. Results were saved in XML format, and performance metrics, such as processing time per document, were recorded.Output and Comparison:
converted_output_grobid
) and one for Docling's results (converted_output_docling
).Questions and Observations
My early experiments show that GROBID is significantly faster than Docling. I suspect Docling appears to use the GPU less effectively, which may explain the (MUCH) longer processing times.
Given this, I have the following questions:
Is My Setup Fair and Correct?
Can I Optimize GPU Usage for Docling?
Does Docling Focus on Different Use Cases?
Additional Context
docker run --name GROBID --rm --gpus all --init --ulimit core=0 -p 9070:8070 -p 9081:8071 grobid/grobid:0.8.0
Thank you for your insights!