Alternatives to GROBID (PDF parsing)

BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction

https://blue-brain-search.readthedocs.io

GNU Lesser General Public License v3.0

40 stars 10 forks source link

Alternatives to GROBID (PDF parsing) #476

Open jankrepl opened 2 years ago

jankrepl commented 2 years ago

Are there any alternatives to GROBID and would there be any major advantages in using them?

Alternatives (feel free to add new entries)

Comments

If we go for a pure Python solution there might not be need for intermediary formats (i.e. TEI XML for GROBID)

FrancescoCasalegno commented 2 years ago

I think that having a benchmark of various possible solutions is a good idea.

I also agree that using GROBID creates some complications:

the output is an intermediary format
you need to docker pull a GROBID image to run the server – but how do we track the version of the GROBID server running?
instead of directly calling a function, we need to send requests to a server, which may be seen as an unnecessary complication

But maybe for the moment, we can wait to see some failure cases of GROBID on our articles before thinking about alternatives. In the end GROBID seems to be a well-established solution, used e.g. by the creator of CORD-19. What do you think @jankrepl ?

FrancescoCasalegno commented 2 years ago

Also, I had a look at the paper they used in that blog post for their benchmark: https://schoolshooters.info/sites/default/files/2014-NaBITA-Whitepaper-Text-with-Graphics.pdf

I think it looks a bit simple (was it written in Google Docs/Word and then saved as PDF?) compared to other two-column articles with lots of figures and tables generated with LaTeX like the ones we have to deal with.

So when we want to run this benchmark I think we should test also on different kinds of papers.

EmilieDel commented 2 years ago

Small side note related to this: GROBID is saving the version used to convert the PDF to TEI XML in the xml file (see here).

pafonta commented 2 years ago

As an alternative to GROBID, there is the solution here, developed in the context of OpenMinTeD.

The extracted text could be accessed through document_text here.

BlueBrain / Search

Alternatives to GROBID (PDF parsing) #476

Alternatives (feel free to add new entries)

Other links

Comments