adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

JATS affiliation parsing is poor in certain cases #54

Closed seasidesparrow closed 10 months ago

seasidesparrow commented 1 year ago

Describe the bug Publishers may field affiliations with multiple institution tags having different attributes (e.g. content-type="org-division" and content-type="org-name"). With the current jats parser (v0.9.6) the affiliation data are being decomposed, stripping the embedded data of their context, and may be output with poor formatting (e.g. missing spaces between elements)

To Reproduce Use ingest parser to parse abstracts/sources/SPRINGER/files/JOU=41467/VOL=2023.14/ISU=1/ART=4026 1/41467_2023_Article_40261_nlm.xml. Parsing produces the following affiliation string:

Clem Jones Centre for Ageing Dementia Research, Queensland Brain InstituteThe University of Queensland4072BrisbaneQLDAustralia

Additional context Example from the file noted above:

<aff id="Aff1"><label>1</label><institution-wrap><institution-id institution-id-type="GRID">grid.1003.2<
/institution-id><institution-id institution-id-type="ISNI">0000 0000 9320 7537</institution-id><institution content-type="org-division">Clem Jones Centre for Ageing Dementia Research, Queensland Brain Institute</institution><institution content-type="org-name">The University of Queensland</institution></institution-wrap><addr-line content-type="postcode">4072</addr-line><addr-line content-type="city">Brisbane</addr-line><addr-line content-type="state">QLD</addr-line><country country="AU">Australia</country></aff>
seasidesparrow commented 1 year ago

See https://jats.nlm.nih.gov/publishing/tag-library/1.3/element/aff.html for documentation

seasidesparrow commented 10 months ago

This would be a good issue for Mugdha, I think.