adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Name normalization should include .strip(), or at least .lstrip() #71

Open seasidesparrow opened 10 months ago

seasidesparrow commented 10 months ago

Describe the bug There is at least one instance of publisher-supplied metadata supplying a surname with extraneous left spaces. This has a downstream effect that interferes with the current ADSIngestEnrichment bibcode generator; that code selects the left-most character for the author initial (name[0]). The issue will be fixed in the enrichment package, but this is a data normalization issue that should happen at parse time.

To Reproduce From the 2023-10-21 data delivery from IOP, take the file 2053-1591_10_10_105303/mrx_10_10_105303.xml which has the first author's name fielded as <surname> M</surname> (note preceeding space). Parse the file with JATSParser().parse. The resulting json will include

  "authors": [
    {
      "name": {
        "surname": " M",

Additional context Add any other context about the problem here.