New method of whitespace joining

jemrobinson commented 5 years ago

Rewrite of the plain_html logic. The sequence is now as follows:

Insert space into non-spaced comments so that html5lib can interpret them correctly
Convert the HTML into a Soup parse tree
Remove comments and DOCTYPE strings
Process CDATA (currently we remove it)
Strip tag attributes apart from 'class' and 'style'
Remove blacklisted elements
Unwrap elements where we want to keep the text but drop the containing tag
Process elements with special innerText handling
Process unknown elements
Consolidate text, joining any consecutive NavigableStrings together
Remove empty strings and elements
Replace <br> and <hr> elements with paragraph breaks, wrapping in <p> tags where needed
Normalise all strings, removing whitespace and fixing unicode issues
Wrap any remaining bare text in a suitable block level element
Recursively replace any elements which have no children or only zero-length children

The main differences are

Move the string consolidation before whitespace normalisation (which closes #45 and closes #46).
Combine the search and replacement of <br> and <hr> elements into one function (removing fragile function ordering problems)

jemrobinson commented 5 years ago

I've added some tests which fail if the ordering constraints are violated. The paragraph breaking code calls consolidate in between the identify and split stages, so that the newly-added break indicators can be included in the correct string.

jemrobinson commented 5 years ago

Reviewed in person with @martintoreilly . Only outstanding question is which of these is correct behaviour:

<div>text</div> => <div>text</div>
<div>text</div> => <div><p>text</p></div>

I think it is (1) and this is what the code currently does, but should be confirmed.

Confirmed that (1) is correct with @martintoreilly . See comment in plain_html.py:

We do this to ensure that there is a strong, unique correspondance between presentational paragraphs and DOM structure
     - all presentational paragraphs should be the only content associated with their immediate parent
     - all presentational paragraphs at the same conceptual level should be equally nested
     - the string as displayed in the browser should be equivalent to the innerHTML of the parent (so that indexing is equivalent between presentation and source)

    The following examples should not be allowed:

     1. Two presentational elements at the same DOM level have non-equivalent index levels
       <div index="1.1">
         text
         <p index="1.1.1">more text</p>
       </div>

     2. Index 1.1 might contain both strings
       <div index="1.1">
         <p index="1.1.1">more text</p>
         text
       </div>

     3. Two presentational paragraphs are included in the same index
       <div index="1.1">
         text
         <p index="1.1.1">more text</p>
         yet more text
       </div>

alan-turing-institute / ReadabiliPy

New method of whitespace joining #47