adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Abstracts in jats files with multiple embedded <p> tags are dropping all subsequent to the first #50

Closed seasidesparrow closed 1 year ago

seasidesparrow commented 1 year ago

Describe the bug If a jats abstract contains multiple sections separated by a paragraph (

) tag, jats parser will only capture the first of these.

To Reproduce Astronomy and Astrophysics abstracts may have multiple paragraph tags for "Context", "Aims", "Methods" and "Results". Try parsing abstracts/data/A+A/A+A670/abstracts/aa42959-21.xml

The abstract returned will be

  "abstract": {
    "textEnglish": "Context. V838 Monocerotis is a peculiar binary that underwent an immense stellar explosion in 2002, leaving behind an expanding cool supergiant and a hot B3V companion. Five years after the outburst, the B3V companion disappeared from view, and has not returned to its original state."
  },

and is missing the "Aims", "Methods", and "Results" sections.

Additional context Line 551 of parsers.jats is where the abstract is being extracted. it is using a "find('p')" for the paragraph tag, instead of iterating over a "find_all('p')"