adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Capture affiliation ID data for all parsers when available #104

Open seasidesparrow opened 3 months ago

seasidesparrow commented 3 months ago

Is your feature request related to a problem? Please describe. Currently, parsers capture affiliation data in text format, and these are added to "affPubRaw" in the ingest data model Affil object. However, affiliation data may also provided as an affiliation identifier in various systems, e.g. ROR, ISNI or GRID, either with or in place of text data. As an example, crossref XML includes the tag <institution_id type="TYPE"> as a possible return field. (See https://www.crossref.org/documentation/schema-library/markup-guide-metadata-segments/affiliations/). The ADS Ingest_Data_Model Affil object already has space for affPubID and affPubIDType, but they are not implemented in base.py or any other parsers yet.

Describe the solution you'd like We should add logic to each of the content parsers that can detect and properly field insitution identifiers, and store them in the ingest_data_model.affils.affPubID and affPubIDType fields for each contributor that has them.

Additional context As an example, the input test file jats_springer_EPJC_s10052-023-11699-1.xml has <institution_id> tags for both GRID and ISNI:

[...]
                                <aff id="Aff154">
                                        <label>154</label>
                                        <institution-wrap>
                                                <institution-id institution-id-type="GRID">grid.470046.1</institution-id>
                                                <institution-id institution-id-type="ISNI">0000 0004 0452 0652</institution-id>
                                                <institution content-type="org-name">CPPM, Aix-Marseille Université, CNRS/IN2P3</institution>
                                        </institution-wrap>
                                        <addr-line content-type="city">Marseille</addr-line>                            
                                        <country country="FR">France</country>
                                </aff>
[...]

In this particular example, we see two identifiers, GRID and ISNI. Currently, the ingest_data_model is expecting a single value here; we might consider updating the data model to support a list of id-type objects, or merge multiple values into a single string via a join statement.