adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Name and collaboration post-processing #95

Open seasidesparrow opened 7 months ago

seasidesparrow commented 7 months ago

This issue relates to author and collaboration names that are poorly-fielded by the publisher, and need to be disentangled as an additional parsing and normalization step. As an example, the metadata in /proj/ads/abstracts/data/IOPP/2023/2023-03-15/0004-637X/0004-637X_945/0004-637X_945_2/0004-637X_945_2_124/apj_945_2_124.xml has the following for the first author:

<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Abdurashidova</surname><given-names>The HERA Collaboration: Zara</given-names></name><xref ref-type="aff" rid="affiliation01">1</xref></contrib>

As part of the normalization process, we need a utility that can take the parsed data from the publisher record and look for problematic data like this. In this case, the jats parser will field the first author as Abdurashidova, The HERA Collaboration: Zara, which needs to be reparsed as "The HERA Collaboration"; Abdurashidova, Zara

One approach would be the author_names.py program in the old adsabs-pyingest repository: https://github.com/adsabs/adsabs-pyingest/blob/master/pyingest/parsers/author_names.py

seasidesparrow commented 7 months ago

This issue replaces https://github.com/adsabs/ADSManualParser/issues/12