CederGroupHub / LimeSoup

LimeSoup is a package to parse HTML or XML papers from different publishers.
MIT License
19 stars 7 forks source link

[Springer]Paragraphs containing bullet points #33

Closed zhugeyicixin closed 5 years ago

zhugeyicixin commented 5 years ago

The "Conclusion" part in "10.1007/s40964-017-0023-1" has only 1 paragraph while the parsed result has 3. I think it is because bullet points are used in that paper. (see unit test LimeSoup/test/test_springer/test_springer.py) @IAmGrootel

hhaoyan commented 5 years ago

Wait, shouldn't we use 3 paragraphs? Do we consider bullet points as a single paragraph?

zhugeyicixin commented 5 years ago

I think there could be some cases where bullet points are separated paragraphs. But at least for this one, it looks more like a single paragraph because the HTML contents are wrapped in the same DIV and the words "all the experimental results are summarized here:" indicate the paragraph is not ended.

image

Wait, shouldn't we use 3 paragraphs? Do we consider bullet points as a single paragraph?

hhaoyan commented 5 years ago

You are partially right: this unordered list is enclosed in a div element with class "Para", so it's natural to think that these items belong to that paragraph. However, this definition of "paragraph" is problematic because if you browse through the file, not all elements in the same div are actually part of a single paragraph. Let's not define paragraphs in this way. Usually, paragraph is thought as a continuous block of text representing some common ideas (correct me if I'm wrong). We probably care more about how they are arranged and displayed in the paper.

There is a formal definition of a paragraph in HTML language, see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p a paragraph is defined as a block-level element, meaning that they are displayed in a block. A paragraph can be made explicit, as you want to enclose a paragraph inside <p></p>. A paragraph can also be made implicit, by using other block-level elements, such as <div>, <ul>, <form>, etc. to forcibly close a block element. A <p> element does not necessarily mean a real paragraph, because you can set its display: inline to force it to become an inline element, see https://www.w3schools.com/cssref/tryit.asp?filename=trycss_display. You can see, it's pretty hard, but not impossible, to decide what are paragraphs.

To keep it simple, let's not worry about rule breakers such as setting the display: inline property. But we do know list elements are rendered as block-level paragraphs. To keep it even simpler, in this case, since bullet points are rendered as separate blocks of text in a paper or an HTML file, I don't think these three blocks belong to the same paragraph anymore.

hhaoyan commented 5 years ago

To clarify, see this funny fiddle I created: https://jsfiddle.net/67xrtqvc/

zhugeyicixin commented 5 years ago

Yep, I think this definition would be more universal and changed the test code.

BTW, the fiddle is really convenient!

zhugeyicixin commented 5 years ago

So the original parser is correct, I will close this issue.