adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Pub month versus collection month #122

Open seasidesparrow opened 3 months ago

seasidesparrow commented 3 months ago

Describe the bug Records can have multiple <pub-date publication-format="electronic"> tags. In particular, Springer/Nature can have <pub-date date-type="pub"> and <pub-date date-type="collection"> to indicate the publication dates of the paper itself, and the formal publication date of the collection (e.g. issue) it belongs to. The current parser does find_all("pub-date") so the parser will iterate over all dates found. If it has already found an electronic date, it will be overwritten by subsequent ones.

To Reproduce See the file /proj/ads/abstracts/data/NATURE/npj/NPJ.052224/JOU=41467/VOL=2024.15/ISU=1/ART=48265/41467_2024_Article_48265_nlm.xml. There are two <pub-date publication-format="electronic"> tags in this file, the first having date-type="pub" (May 2024) and the second, "collection" (December 2024). In this case the article itself was published in May 2024 but the parser records it as December.

Additional context You can do a find_all, but you need to check the date-type along with publication-format, with "pub" being the preferred date-type over "collection". Collection can go into otherDateType.