adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Elsevier parser is unable to parse records with doctype `<ja:simple-article>` #92

Closed seasidesparrow closed 5 months ago

seasidesparrow commented 6 months ago

Describe the bug The ElsevierParser.parse function creates a BeautifulSoup object self.record_meta by locating the <ja:article> tag only (L358). There are apparently a small number of incoming metadata files from the publisher with tag <ja:simple-article> that correspond to (e.g.) dissertation abstracts, letters/comments, etc. The parse function will generate an exception anywhere that attempts self.record_meta.find([element]) without first determining that self.record_meta is a soup tree, rather than NoneType.

To Reproduce Steps to reproduce the behavior: Attempt to parse any of the following files in /proj/ads/abstracts/ingest/ADSManualParser/ELS.test/: S0024493724000495.xml, S1571064523001112.xml, S157106452400006X.xml, S1571064524000083.xml, S1571064524000095.xml

Additional context Add any other context about the problem here.

seasidesparrow commented 5 months ago

Fixed in #93