alan-turing-institute / ReadabiliPy

A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.
MIT License
230 stars 36 forks source link

New method of whitespace joining #47

Closed jemrobinson closed 5 years ago

jemrobinson commented 5 years ago

Rewrite of the plain_html logic. The sequence is now as follows:

The main differences are

  1. Move the string consolidation before whitespace normalisation (which closes #45 and closes #46).
  2. Combine the search and replacement of <br> and <hr> elements into one function (removing fragile function ordering problems)
jemrobinson commented 5 years ago

I've added some tests which fail if the ordering constraints are violated. The paragraph breaking code calls consolidate in between the identify and split stages, so that the newly-added break indicators can be included in the correct string.

jemrobinson commented 5 years ago

Reviewed in person with @martintoreilly . Only outstanding question is which of these is correct behaviour:

  1. <div>text</div> => <div>text</div>
  2. <div>text</div> => <div><p>text</p></div>

I think it is (1) and this is what the code currently does, but should be confirmed.

Confirmed that (1) is correct with @martintoreilly . See comment in plain_html.py:

We do this to ensure that there is a strong, unique correspondance between presentational paragraphs and DOM structure
     - all presentational paragraphs should be the only content associated with their immediate parent
     - all presentational paragraphs at the same conceptual level should be equally nested
     - the string as displayed in the browser should be equivalent to the innerHTML of the parent (so that indexing is equivalent between presentation and source)

    The following examples should not be allowed:

     1. Two presentational elements at the same DOM level have non-equivalent index levels
       <div index="1.1">
         text
         <p index="1.1.1">more text</p>
       </div>

     2. Index 1.1 might contain both strings
       <div index="1.1">
         <p index="1.1.1">more text</p>
         text
       </div>

     3. Two presentational paragraphs are included in the same index
       <div index="1.1">
         text
         <p index="1.1.1">more text</p>
         yet more text
       </div>