Closed jemrobinson closed 5 years ago
I've added some tests which fail if the ordering constraints are violated. The paragraph breaking code calls consolidate in between the identify and split stages, so that the newly-added break indicators can be included in the correct string.
Reviewed in person with @martintoreilly . Only outstanding question is which of these is correct behaviour:
<div>text</div>
=> <div>text</div>
<div>text</div>
=> <div><p>text</p></div>
I think it is (1) and this is what the code currently does, but should be confirmed.
Confirmed that (1) is correct with @martintoreilly . See comment in plain_html.py
:
We do this to ensure that there is a strong, unique correspondance between presentational paragraphs and DOM structure
- all presentational paragraphs should be the only content associated with their immediate parent
- all presentational paragraphs at the same conceptual level should be equally nested
- the string as displayed in the browser should be equivalent to the innerHTML of the parent (so that indexing is equivalent between presentation and source)
The following examples should not be allowed:
1. Two presentational elements at the same DOM level have non-equivalent index levels
<div index="1.1">
text
<p index="1.1.1">more text</p>
</div>
2. Index 1.1 might contain both strings
<div index="1.1">
<p index="1.1.1">more text</p>
text
</div>
3. Two presentational paragraphs are included in the same index
<div index="1.1">
text
<p index="1.1.1">more text</p>
yet more text
</div>
Rewrite of the
plain_html
logic. The sequence is now as follows:<br>
and<hr>
elements with paragraph breaks, wrapping in<p>
tags where neededThe main differences are
<br>
and<hr>
elements into one function (removing fragile function ordering problems)