adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

DataCite parser should support detection of multiple names in one `<creatorName>` tag #112

Open seasidesparrow opened 5 months ago

seasidesparrow commented 5 months ago

Describe the bug At least one important DataCite source, NASA IRSA, publishes records where all authors attributed to a given data set are included in one <creatorName> tag; individual names are separated by semicolons. As an example, the DataCite record for doi:10.26131/IRSA1 has the following creators structure:

<creators>
<creator>
<creatorName>Wright, Edward L.; Eisenhardt, Peter R. M.; Mainzer, Amy K.; Ressler, Michael E.; Cutri, Roc M.; Jarrett, Thomas; Kirkpatrick, J. Davy; Padgett, Deborah; McMillan, Robert S.; Skrutskie, Michael; Stanford, S. A.; Cohen, Martin; Walker, Russell G.; Mather, John C.; Leisawitz, David; Gautier, Thomas N., III; McLean, Ian; Benford, Dominic; Lonsdale, Carol J.; Blain, Andrew; Mendez, Bryan; Irace, William R.; Duval, Valerie; Liu, Fengchuan; Royer, Don; Heinrichsen, Ingolf; Howard, Joan; Shannon, Mark; Kendall, Martha; Walsh, Amy L.; Larsen, Mark; Cardon, Joel G.; Schick, Scott; Schwalm, Mark; Abid, Mohamed; Fabinsky, Beth; Naes, Larry; Tsai, ChaoWei</creatorName>
</creator>
</creators>

Parsing with the current DC parser (v0.9.20) and passing to Manual Parser's classic tagger produces one author, %A Eisenhardt Wright, Edward L. ;

To Reproduce Parse the file /proj/ads/abstracts/sources/DataCite/doi/10.26131/irsa1.xml with the DataCite parser.

Additional context It's not clear how often we will see similar constructs in other published datasets, but we should have logic in place that can detect things like very long <creatorName> fields, and possible entity separators like semicolons or pipes. We may want to make this logic restricted to certain DOI prefixes (e.g. 10.26131, NASA IRSA), or make it available to the parser generally.