Improve reference extraction

miguelgrc commented 3 months ago

We currently use https://github.com/inspirehep/refextract for reference extraction which, according to this paper obtains a 0.49 F1 score, while models like CERMINE or GROBID obtain 0.74 and 0.79 respectively. This makes it interesting to consider the replacement of refextract, which we need to maintain, with one of these two models.

However, we need to further analyze the dataset used in the mentioned paper, as there is another article that analyzes those two tools (together with others, but not including refextract) on different fields, and they don't perform very well in the Physics and Astronomy field. However, their dataset is very limited, with only two papers per field, so we can't draw clear conclusions from that. Also, this paper includes other tools in the comparison, but they don't seem to be capable of parsing a pdf, but rather need the reference string as an input, which is not what we want (we could add a previous pdftotext step, but I don't know how much the results will improve, since refextract already uses it as well as Science Parse, mentioned in the first paper, which scores the same as refextract, and this could make us suspect that pdftotext is a limiting factor). Therefore, it seems like the most sensible idea would be to evaluate CERMINE and GROBID vs refextract:

We need to first select a set of papers that we can use to evaluate the different models and then to pass them through each of them, comparing the results obtained
We will also need to obtain clear requirements regarding the information that the tool needs to produce (e.g. author, title, doi, url, journal name...) in order to be able to evaluate those results, and possible known limitations or recurring errors @michamos
It would also be beneficial to know the cost in terms of computational resources and time of the current refextract running in production, and to compare it with the alternatives

One small obstacle for running either of those tools is that don't offer dockerized versions that run in ARM, so it's not that easy to run them locally: we will have to try setting up a Java runtime and running them from the source instead. However, CERMINE offers a website that we can use for some initial testing: http://cermine.ceon.pl (example)

miguelgrc commented 3 months ago

After some manual testing:

Refextract:
- Frequently misses data such as volume or pages and doesn't parse the journal accurately.
- Sometimes it can recognize part of the journal title as the document title (misc field).
- In some papers I have seen it making up new references (splitting one reference into two, having part of the title as author and the rest as title).
- Output: JSON
- Runtime: 2.96s
Cermine:
- Usually recognizes more fields, although it can make some occasional mistakes such as adding one acronym at the beginning of the title to the authors list or including the journal title as part of the document title.
- Can miss some references or label them as unknown. It's not common, but it specially has a lot of problems with https://arxiv.org/pdf/2408.00356 for example, as it only parses 31 dependencies out of 71 (for which -disregarding whether they are correct or complete- refextract gets 71 and Grobid gets 73)
- Output: bibtex
- Runtime: 5.7s
Grobid:
- I have found one case where it got confused with the author names, and assigned the name initial to the incorrect surname (e.g. Benedikt, M., Blondel, A., Brunner, O., Capeans Garrido, M., Cerutti, F., Gutleber, J., Janot, P., Jimenez, J.M., Mertens, V., Milanese, A., Oide, K., Osborne, J.A., Otto, T., Papaphilippou, Y., Poole, J., Tavian, L.J., Zimmermann, F. -> Benedikt, M and Blondel, A and Brunner, O and Capeans, M and Garrido, F and Cerutti, J and Gutleber, P and Janot, J and Jimenez, V and Mertens, A and Milanese, K and Oide, J and Osborne, T and Otto, Y and Papaphilippou, J and Poole, L and Tavian, F and Zimmermann, it got confused with Capeans.
- I have observed a similar behavior as with refextract, sometimes making up new, non-existing references
- Output: bibtex
- Runtime: 1.66s

Sample document I got from inspire with the respective extracted references where you can observe all these scenarios:

@michamos @drjova let me know your opinion from what you can see in this example and which fields are more or less important for us (I assume doi, authors, journal, year are important, while volume or pages are not that crucial) so that, if we want to continue with this, we can also think about automated tests on a bigger dataset to extract some more representative statistics (and to evaluate the performance with parallel requests), and how to do it. First thoughts:

We can compare the author lists between them. This would require some pre-processing as, for example, for authors refextract returns F. Deng, Y. Lü while Cermine returns Deng, F., Lü, Y. and Grobid returns Deng, F and Lü, Y. But how do we score them? Do we simply give more points to the one with more authors? What if, as it happened once with Cermine, we read part of the title as an author?
We can check whether or not they have extracted the DOI, maybe even query arxiv or inspire to verify that the DOI is correct. In fact if this works well we could use this to fetch the actual author list from e.g. arxiv (if the paper is there) and verify the previous point.
We can count the amount of references each of them extracts.
We can compare journal titles, years...

michamos commented 3 months ago

I'm happy to discuss this further when I'm back from holidays. Here are a few pointers already:

the most important fields are those used by the reference matcher: https://github.com/inspirehep/inspirehep/blob/master/backend/inspirehep/matcher/config.py
GROBID gets confused by journal letters in volumes and drops them, which is a big issue as it was the style used on legacy INSPIRE
I have no experience with CERMINE, but dropping parts of the reference list is a big problem, as it's quite hard to detect

Something else that might be worth looking at and that we discussed during the INSPIRE week is to use crossref to do the heavy lifting for affiliation parsing + linking. They have a service where you can give a string containing a reference list and it will return a list of DOIs that matched.

cern-sis / issues-inspire

Improve reference extraction #505