cern-sis / issues-inspire

0 stars 0 forks source link

Improve reference extraction #505

Open miguelgrc opened 3 months ago

miguelgrc commented 3 months ago

We currently use https://github.com/inspirehep/refextract for reference extraction which, according to this paper obtains a 0.49 F1 score, while models like CERMINE or GROBID obtain 0.74 and 0.79 respectively. This makes it interesting to consider the replacement of refextract, which we need to maintain, with one of these two models.

However, we need to further analyze the dataset used in the mentioned paper, as there is another article that analyzes those two tools (together with others, but not including refextract) on different fields, and they don't perform very well in the Physics and Astronomy field. However, their dataset is very limited, with only two papers per field, so we can't draw clear conclusions from that. Also, this paper includes other tools in the comparison, but they don't seem to be capable of parsing a pdf, but rather need the reference string as an input, which is not what we want (we could add a previous pdftotext step, but I don't know how much the results will improve, since refextract already uses it as well as Science Parse, mentioned in the first paper, which scores the same as refextract, and this could make us suspect that pdftotext is a limiting factor). Therefore, it seems like the most sensible idea would be to evaluate CERMINE and GROBID vs refextract:

One small obstacle for running either of those tools is that don't offer dockerized versions that run in ARM, so it's not that easy to run them locally: we will have to try setting up a Java runtime and running them from the source instead. However, CERMINE offers a website that we can use for some initial testing: http://cermine.ceon.pl (example)

miguelgrc commented 3 months ago

After some manual testing:

Sample document I got from inspire with the respective extracted references where you can observe all these scenarios:

@michamos @drjova let me know your opinion from what you can see in this example and which fields are more or less important for us (I assume doi, authors, journal, year are important, while volume or pages are not that crucial) so that, if we want to continue with this, we can also think about automated tests on a bigger dataset to extract some more representative statistics (and to evaluate the performance with parallel requests), and how to do it. First thoughts:

  1. We can compare the author lists between them. This would require some pre-processing as, for example, for authors refextract returns F. Deng, Y. Lü while Cermine returns Deng, F., Lü, Y. and Grobid returns Deng, F and Lü, Y. But how do we score them? Do we simply give more points to the one with more authors? What if, as it happened once with Cermine, we read part of the title as an author?
  2. We can check whether or not they have extracted the DOI, maybe even query arxiv or inspire to verify that the DOI is correct. In fact if this works well we could use this to fetch the actual author list from e.g. arxiv (if the paper is there) and verify the previous point.
  3. We can count the amount of references each of them extracts.
  4. We can compare journal titles, years...
michamos commented 3 months ago

I'm happy to discuss this further when I'm back from holidays. Here are a few pointers already:

Something else that might be worth looking at and that we discussed during the INSPIRE week is to use crossref to do the heavy lifting for affiliation parsing + linking. They have a service where you can give a string containing a reference list and it will return a list of DOIs that matched.