adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Author name parser is failing in some cases, possibly unicode-related (ccaron) #25

Closed seasidesparrow closed 2 years ago

seasidesparrow commented 2 years ago

Describe the bug Parsing certain files results in an ingest data model whose author(s) may have a given name, but no surname. The three cases I checked all had the entity "ccaron" (č) in the author surname.

To Reproduce Parse the file Crossref2/doi/10.1002/./as/na/,1/92/52/25/06/02//metadata.xml or Crossref2/doi/10.1002/./as/na/,2/00/31/01/51//metadata.xml or Crossref2/doi/10.1002/./as/na/,1/93/92/68/11/05//metadata.xml

Additional context Add any other context about the problem here.

seasidesparrow commented 2 years ago

The issue is happening because processing works differently depending on whether the input data are read in as binary data or text data. If the files are loaded with something like with open(infile, 'rb') as finput: data=finput.read() the unicode data causes some exception and fails silently on just those fields. Reading it as text data -- open(infile, 'r') -- results in correctly-interpreted data.

The immediate fix is to read input files strictly as text, and no change is required to IngestParser itself.