alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

Plain content approach #27

Open martintoreilly opened 5 years ago

martintoreilly commented 5 years ago

Problem

We do not extract all the article text from this test article from davidwolfe.com. We are missing.

Our list of "leaf nodes" from which we extract text content is insufficient. These are currently <p>s and <li>s (and <ul>s and <ol>s that are turned into <p>s earlier).

martintoreilly commented 5 years ago

Problems generating plain content

In the current approach we identify "leaf" nodes and convert their contents to plain text. We currently consider the following HTML elements to denote "leaves":

Note that Readability.js replaces any div elements that just hold "inline" content with a corresponding p element, so we capture these in our p processing.

In the breaking test document we have a h4 header elements, which we fail to convert to plain text. Looking further at the HTML standard, there are several more elements like this we are currently not catching.

martintoreilly commented 5 years ago

Problems generating plain text

In the current approach we identify "paragraph" nodes in the plain_content output and extract their contents as plain text. We currently consider the following HTML elements to denote "paragraphs":

For ul and ol list elements we do some tweaking of the content of child li elements to represent them as a string of the format * item 1, * item 2, * item 3,.

In the breaking test document we have a blockquote element containing both explicit p elements and an implicit paragraph that is just a string within the blockquote element. We currently fail to extract this string. In general, we would fail to extract any text whose immediate parent was not a p, ul or ol.

martintoreilly commented 5 years ago

New approach for plain content generation

The new approach will remove all html tags that are not in a specified whitelist of block-level elements, replacing most of the removed elements with their plain text content, but dropping some blacklisted elements completely (e.g. img, video, audio, script).

To do this we will implement the following algorithm:

  1. Start at the top-level "readability" div element we have wrapped the article HTML in.
  2. Navigate to the next "leaf" node (terminal child element) under the current element.
  3. Process the node: a. If the node is a Text element, keep it. b. If the node is not a Text element and is a blacklisted element, remove it c. If the node is not a Text element, is not a blacklisted element, and has special handling rules, apply these. d. If the node is not a Text element, is not a blacklisted element, and has no special handling rules replace it with the node's innerText. This is a representation of the layout with all styling removed. In our case, we will have ensured all children are text, so we can also just grab the textContent. Both of these are defined in the HTML standard. In practice, we will probably use Beautiful Soup's get_text(" ", strip=True) function, stripping training and leading whitespace from child Text elements before recombining them with a single space between them into a single Text element.
  4. Navigate to the node's parent.
  5. If the parent node is a blacklisted element, remove the node.
  6. If the parent node is a whitelisted element and the previously processed child node is both the only child node and a Text element, wrap this child Text element in a p paragraph element.
  7. If the parent has unprocessed children, goto step 2.
  8. If the parent has no unprocessed children, goto step 4.

Element lists

Block-level whitelist

article aside blockquote caption colgroup col div dl dt dd figure figcaption footer h1 h2 h3 h4 h5 h6 header li main ol p pre section table tbody thead tfoot tr td th ul

Blacklist for complete removal

These elements will be completely removed, along with all their children. Q: Should we replace these with a " content removed" placeholder? We were discussing doing this for MathML, though I've decided to treat this the same as other embedded content and just remove this for now.

Elements with special handling

Undecided

Remaining elements

These elements will be replaced with their innerText (concatenated text representations of all their children, with sensible whitespace rules for concatenation). a abbr address b bdi bdo cite code del dfn em i ins kbs mark q rb ruby rp rt rtc s samp small span strong u var wbr

Notes on classification of elements:

martintoreilly commented 5 years ago

@evelinag ☝️Please could you take a look at my plan for plain content generation above? ☝️

sgibson91 commented 5 years ago

Adding new site tests

Plain content rules

One block level element should only contain either: 1) text; 2) or block level elements.

Any other scenarios fail tests.

In scenarios where we find paragraph, line of text, paragraph, line of text - these should be wrapped into a paragraph. Can be visually inspected.

Process

Run ExtractArticle.js to extract title, byline and content etc. of article. Then make minimal changes to the content to generate plain text.