Describe the bug
At least one important DataCite source, NASA IRSA, publishes records where all authors attributed to a given data set are included in one <creatorName> tag; individual names are separated by semicolons. As an example, the DataCite record for doi:10.26131/IRSA1 has the following creators structure:
<creators>
<creator>
<creatorName>Wright, Edward L.; Eisenhardt, Peter R. M.; Mainzer, Amy K.; Ressler, Michael E.; Cutri, Roc M.; Jarrett, Thomas; Kirkpatrick, J. Davy; Padgett, Deborah; McMillan, Robert S.; Skrutskie, Michael; Stanford, S. A.; Cohen, Martin; Walker, Russell G.; Mather, John C.; Leisawitz, David; Gautier, Thomas N., III; McLean, Ian; Benford, Dominic; Lonsdale, Carol J.; Blain, Andrew; Mendez, Bryan; Irace, William R.; Duval, Valerie; Liu, Fengchuan; Royer, Don; Heinrichsen, Ingolf; Howard, Joan; Shannon, Mark; Kendall, Martha; Walsh, Amy L.; Larsen, Mark; Cardon, Joel G.; Schick, Scott; Schwalm, Mark; Abid, Mohamed; Fabinsky, Beth; Naes, Larry; Tsai, ChaoWei</creatorName>
</creator>
</creators>
Parsing with the current DC parser (v0.9.20) and passing to Manual Parser's classic tagger produces one author, %A Eisenhardt Wright, Edward L. ;
To Reproduce
Parse the file /proj/ads/abstracts/sources/DataCite/doi/10.26131/irsa1.xml with the DataCite parser.
Additional context
It's not clear how often we will see similar constructs in other published datasets, but we should have logic in place that can detect things like very long <creatorName> fields, and possible entity separators like semicolons or pipes. We may want to make this logic restricted to certain DOI prefixes (e.g. 10.26131, NASA IRSA), or make it available to the parser generally.
Describe the bug At least one important DataCite source, NASA IRSA, publishes records where all authors attributed to a given data set are included in one
<creatorName>
tag; individual names are separated by semicolons. As an example, the DataCite record for doi:10.26131/IRSA1 has the following creators structure:Parsing with the current DC parser (v0.9.20) and passing to Manual Parser's classic tagger produces one author,
%A Eisenhardt Wright, Edward L. ;
To Reproduce Parse the file /proj/ads/abstracts/sources/DataCite/doi/10.26131/irsa1.xml with the DataCite parser.
Additional context It's not clear how often we will see similar constructs in other published datasets, but we should have logic in place that can detect things like very long
<creatorName>
fields, and possible entity separators like semicolons or pipes. We may want to make this logic restricted to certain DOI prefixes (e.g. 10.26131, NASA IRSA), or make it available to the parser generally.