This issue relates to author and collaboration names that are poorly-fielded by the publisher, and need to be disentangled as an additional parsing and normalization step. As an example, the metadata in /proj/ads/abstracts/data/IOPP/2023/2023-03-15/0004-637X/0004-637X_945/0004-637X_945_2/0004-637X_945_2_124/apj_945_2_124.xml has the following for the first author:
As part of the normalization process, we need a utility that can take the parsed data from the publisher record and look for problematic data like this. In this case, the jats parser will field the first author as Abdurashidova, The HERA Collaboration: Zara, which needs to be reparsed as "The HERA Collaboration"; Abdurashidova, Zara
This issue relates to author and collaboration names that are poorly-fielded by the publisher, and need to be disentangled as an additional parsing and normalization step. As an example, the metadata in /proj/ads/abstracts/data/IOPP/2023/2023-03-15/0004-637X/0004-637X_945/0004-637X_945_2/0004-637X_945_2_124/apj_945_2_124.xml has the following for the first author:
<contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Abdurashidova</surname><given-names>The HERA Collaboration: Zara</given-names></name><xref ref-type="aff" rid="affiliation01">1</xref></contrib>
As part of the normalization process, we need a utility that can take the parsed data from the publisher record and look for problematic data like this. In this case, the jats parser will field the first author as
Abdurashidova, The HERA Collaboration: Zara
, which needs to be reparsed as"The HERA Collaboration"; Abdurashidova, Zara
One approach would be the
author_names.py
program in the oldadsabs-pyingest
repository: https://github.com/adsabs/adsabs-pyingest/blob/master/pyingest/parsers/author_names.py