Closed seasidesparrow closed 2 years ago
The issue is happening because processing works differently depending on whether the input data are read in as binary data or text data. If the files are loaded with something like with open(infile, 'rb') as finput: data=finput.read()
the unicode data causes some exception and fails silently on just those fields. Reading it as text data -- open(infile, 'r')
-- results in correctly-interpreted data.
The immediate fix is to read input files strictly as text, and no change is required to IngestParser itself.
Describe the bug Parsing certain files results in an ingest data model whose author(s) may have a given name, but no surname. The three cases I checked all had the entity "ccaron" (č) in the author surname.
To Reproduce Parse the file
Crossref2/doi/10.1002/./as/na/,1/92/52/25/06/02//metadata.xml
orCrossref2/doi/10.1002/./as/na/,2/00/31/01/51//metadata.xml
orCrossref2/doi/10.1002/./as/na/,1/93/92/68/11/05//metadata.xml
Additional context Add any other context about the problem here.