adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

jats.py: Author names fielded with "string-name" are not being captured #29

Closed seasidesparrow closed 1 year ago

seasidesparrow commented 1 year ago

Describe the bug JATS or JATS-like (e.g. NLM) records with valid contributors in a "string-name" element (rather than "name") are not being parsed completely.

To Reproduce Parse the NLM file '/proj/ads/abstracts/data/T+F/2022/TF.101322/ymte20.v037.i11/10667857.2021.2016293.xml' using a JATSParser object, and with the .parse argument (bsparser="lxml"). The author names will be missing from the resulting datamodel object, even though there may be affiliations.

Additional context Both JATS and its predecessor NLM 3 can field the author name in a contrib element using the subelement 'string-name' as well as 'name'. The case of 'name' is covered in parsers/jats.py beginning at line 220: https://github.com/adsabs/ADSIngestParser/blob/92e4b7baa74c418376893616d32d53a2ef77e38d/adsingestp/parsers/jats.py#L220 It's less common than 'name', but in this publisher's case (Taylor & Francis) it appears to be standard for their NLM-formatted content; none of the records from the subdirectory tested had author names in the output.

seasidesparrow commented 1 year ago

Fixed by #39