Open miguelgrc opened 3 months ago
After some manual testing:
volume
or pages
and doesn't parse the journal
accurately. Benedikt, M., Blondel, A., Brunner, O., Capeans Garrido, M., Cerutti, F., Gutleber, J., Janot, P., Jimenez, J.M., Mertens, V., Milanese, A., Oide, K., Osborne, J.A., Otto, T., Papaphilippou, Y., Poole, J., Tavian, L.J., Zimmermann, F.
-> Benedikt, M and Blondel, A and Brunner, O and Capeans, M and Garrido, F and Cerutti, J and Gutleber, P and Janot, J and Jimenez, V and Mertens, A and Milanese, K and Oide, J and Osborne, T and Otto, Y and Papaphilippou, J and Poole, L and Tavian, F and Zimmermann
, it got confused with Capeans. Sample document I got from inspire with the respective extracted references where you can observe all these scenarios:
@michamos @drjova let me know your opinion from what you can see in this example and which fields are more or less important for us (I assume doi, authors, journal, year are important, while volume or pages are not that crucial) so that, if we want to continue with this, we can also think about automated tests on a bigger dataset to extract some more representative statistics (and to evaluate the performance with parallel requests), and how to do it. First thoughts:
F. Deng, Y. Lü
while Cermine returns Deng, F., Lü, Y.
and Grobid returns Deng, F and Lü, Y
. But how do we score them? Do we simply give more points to the one with more authors? What if, as it happened once with Cermine, we read part of the title as an author?I'm happy to discuss this further when I'm back from holidays. Here are a few pointers already:
Something else that might be worth looking at and that we discussed during the INSPIRE week is to use crossref to do the heavy lifting for affiliation parsing + linking. They have a service where you can give a string containing a reference list and it will return a list of DOIs that matched.
We currently use https://github.com/inspirehep/refextract for reference extraction which, according to this paper obtains a 0.49 F1 score, while models like CERMINE or GROBID obtain 0.74 and 0.79 respectively. This makes it interesting to consider the replacement of refextract, which we need to maintain, with one of these two models.
However, we need to further analyze the dataset used in the mentioned paper, as there is another article that analyzes those two tools (together with others, but not including refextract) on different fields, and they don't perform very well in the Physics and Astronomy field. However, their dataset is very limited, with only two papers per field, so we can't draw clear conclusions from that. Also, this paper includes other tools in the comparison, but they don't seem to be capable of parsing a pdf, but rather need the reference string as an input, which is not what we want (we could add a previous pdftotext step, but I don't know how much the results will improve, since refextract already uses it as well as Science Parse, mentioned in the first paper, which scores the same as refextract, and this could make us suspect that pdftotext is a limiting factor). Therefore, it seems like the most sensible idea would be to evaluate CERMINE and GROBID vs refextract:
One small obstacle for running either of those tools is that don't offer dockerized versions that run in ARM, so it's not that easy to run them locally: we will have to try setting up a Java runtime and running them from the source instead. However, CERMINE offers a website that we can use for some initial testing: http://cermine.ceon.pl (example)