CCSDForge / episciences

An overlay journal platform
https://www.episciences.org/
GNU General Public License v3.0
27 stars 2 forks source link

[Bug Report] Abstract ignores line breaks #600

Open Gru-gru opened 2 months ago

Gru-gru commented 2 months ago

Describe the bug

When a paper is imported from arXiv, Episcience's displaying of the abstract can differ from that of arXiv, because line breaks are lost. Concrete example: https://theoretics.episciences.org/14397 vs https://arxiv.org/abs/2311.10204

Expected behavior

Line breaks should not be ignored, so that the abstract is shown as the authors intended.

rtournoy commented 2 months ago

yes indeed, thank you for reporting this ; I'm adding notes to the bug report, we need to explore a few options to fix this.

This is what arXiv provides on the web: image a few line breaks <br>

This is what arXiv provides on the API: image The line length seems to end at 80 characters max, thus introducing unwanted line breaks

Source: http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:2311.10204&metadataPrefix=arXivRaw

And this is why we are not merely replacing line feeds \n with html line breaks <br>

the result would look like this: image

This is what the Datacite API provides: image

curl -s https://api.datacite.org/dois/10.48550/arXiv.2311.10204 |jq|grep '"description"' |grep --color '\\\n'

We can try to:

Let's ignore HTML Scraping.