adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Elsevier: loss of formatting in abstract #57

Open csgrant00 opened 1 year ago

csgrant00 commented 1 year ago

/proj/ads/abstracts/data/ELS/CONSYN.AST/2023/ELS.080423/2214-5524/S2214552423X00030/S2214552423000275/S2214552423000275.xml

seasidesparrow commented 1 year ago

The bulleted points at the end of the article are being demarked with &bull;, but without a preceeding <p> or <br>. The Elsevier XML has these elements embedded in a <ce: list>/<ce: list-item> tree, so there are no explicit line breaks; the list elements specify paragraph breaks for each bullet.

Are paragraph tags allowed in abstracts, @csgrant00 ?

csgrant00 commented 1 year ago

I think so, at least I think they should be. I'll try to check...

seasidesparrow commented 1 year ago

This can be addressed by updating this line of code to link the two pieces of text with a "\n" rather than an empty space: https://github.com/adsabs/ADSIngestParser/blob/33ae877f0dc86162182f363f218927f620bdf75b/adsingestp/parsers/elsevier.py#L161

seasidesparrow commented 1 year ago

Will also require either removing the call to self._clean_output at the end of this block, or an upstream change to base parser here: https://github.com/adsabs/ADSIngestParser/blob/33ae877f0dc86162182f363f218927f620bdf75b/adsingestp/parsers/base.py#L43