Open jankrepl opened 2 years ago
I think that having a benchmark of various possible solutions is a good idea.
I also agree that using GROBID
creates some complications:
docker pull
a GROBID
image to run the server – but how do we track the version of the GROBID
server running?But maybe for the moment, we can wait to see some failure cases of GROBID
on our articles before thinking about alternatives. In the end GROBID
seems to be a well-established solution, used e.g. by the creator of CORD-19.
What do you think @jankrepl ?
Also, I had a look at the paper they used in that blog post for their benchmark: https://schoolshooters.info/sites/default/files/2014-NaBITA-Whitepaper-Text-with-Graphics.pdf
I think it looks a bit simple (was it written in Google Docs/Word and then saved as PDF?) compared to other two-column articles with lots of figures and tables generated with LaTeX like the ones we have to deal with.
So when we want to run this benchmark I think we should test also on different kinds of papers.
Small side note related to this: GROBID is saving the version used to convert the PDF
to TEI XML
in the xml file (see here).
As an alternative to GROBID
, there is the solution here, developed in the context of OpenMinTeD.
The extracted text could be accessed through document_text
here.
Are there any alternatives to GROBID and would there be any major advantages in using them?
Alternatives (feel free to add new entries)
Other links
Comments
If we go for a pure Python solution there might not be need for intermediary formats (i.e. TEI XML for GROBID)