The new approach will remove all html tags that are not in a specified whitelist of block-level elements, replacing most of the removed elements with their plain text content, but dropping some blacklisted elements completely (e.g. img, video, audio, script).
To do this we will implement the following algorithm:
Start at the top-level "readability" div element we have wrapped the article HTML in.
Navigate to the next "leaf" node (terminal child element) under the current element.
Process the node:
a. If the node is a Text element, keep it.
b. If the node is not a Text element and is a blacklisted element, remove it
c. If the node is not a Text element, is not a blacklisted element, and has special handling rules, apply these.
d. If the node is not a Text element, is not a blacklisted element, and has no special handling rules replace it with the node's innerText. This is a representation of the layout with all styling removed. In our case, we will have ensured all children are text, so we can also just grab the textContent. Both of these are defined in the HTML standard. In practice, we will probably use Beautiful Soup's get_text(" ", strip=True) function, stripping training and leading whitespace from child Text elements before recombining them with a single space between them into a single Text element.
Navigate to the node's parent.
If the parent node is a blacklisted element, remove the node.
If the parent node is a whitelisted element and the previously processed child node is both the only child node and a Text element, wrap this child Text element in a p paragraph element.
If the parent has unprocessed children, goto step 2.
If the parent has no unprocessed children, goto step 4.
These elements will be completely removed, along with all their children.
Q: Should we replace these with a " content removed" placeholder? We were discussing doing this for MathML, though I've decided to treat this the same as other embedded content and just remove this for now.
q: A quote. We will wrap the innerText quote marks ("s).
sub and sup: Sub- and super-script. We will prefix the innerText with _ for subscript and ^ for superscript.
Undecided
iframe: This content will be rendered and visible in the original article, so I feel we should really render it. It seems the srcdoc attribute can contain HTML source, which takes precedence over the src URL, so we should probably include this somehow if present. If the iframe only includes a URL src, I guess we should fetch the document ourselves if it is not fetched automatically by the GET request.
abbr: Abbreviation or acronym. The title element can contain the expanded version of the abbreviation (e.g. <p>The <abbr title="Web Hypertext Application Technology Working Group">WHATWG</abbr> started working on HTML in 2004.</p>). We could present this as textContent (title) (i.e. <p>The WHATWG (Web Hypertext Application Technology Working Group) started working on HTML in 2004.</p>). However, this is unlikely to be displayed in the main browser rendering, so I'd suggest we just replace with innerText.
bdi and bdo: Sets the text direction of an element, overriding any value inherited from parent elements. We suggest just using the innerText here as it's pretty much the only inline element we can't get rid of with genuinely no consequence. In this case text direction will be inferred from the parent dir attribute and other cues, which I think we can live with. Also, for this project all sites are english so we should not see any issues in our immediate use case.
br: Readability.js wraps text separated by multiple brs in ps, so we can safely remove any remaining brs. In the long run we may want to handle this ourselves.
del and ins: These indicate edits to the document, but the content will usually be displayed (with dome formatting to indicate an insertion or deletion). I think for us we should just keep the text either way as it is almost certainly visible to a user in a browser, so I think we should just replace with innerText.
form: We remove all form elements and their contents.
hr: Horizontal rule. This is really a paragraph level thematic break (e.g. scene or topic change). We should really ensure preceding and following text blocks are wrapped in a block-level element (wrapping them in a section or p ourselves if not). However, this has the same complexity as wrapping text blocks separated by multiple brs, so for now I'd suggest we don't, but we may want to consider doing this later. It may be Readability.js does this, so we should check.
ruby: Used to provide ruby annotations, often representing additional representations of some base text, such as pronunciation guidance, translations or alternate character sets. Replacing the ruby element with the text contents of it's child rb, rt, rtc, rp elements is the standard fallback for user agents who cannot render ruby annotations and is what we will achieve by replacing all these elements with their innerText.
s: Marks content as no longer relevant. The text is generally still displayed so I think we should just replace with innerText.
New approach for plain content generation
The new approach will remove all html tags that are not in a specified whitelist of block-level elements, replacing most of the removed elements with their plain text content, but dropping some blacklisted elements completely (e.g.
img
,video
,audio
,script
).To do this we will implement the following algorithm:
div
element we have wrapped the article HTML in.Text
element, keep it. b. If the node is not aText
element and is a blacklisted element, remove it c. If the node is not aText
element, is not a blacklisted element, and has special handling rules, apply these. d. If the node is not aText
element, is not a blacklisted element, and has no special handling rules replace it with the node'sinnerText
. This is a representation of the layout with all styling removed. In our case, we will have ensured all children are text, so we can also just grab thetextContent
. Both of these are defined in the HTML standard. In practice, we will probably use Beautiful Soup'sget_text(" ", strip=True)
function, stripping training and leading whitespace from childText
elements before recombining them with a single space between them into a singleText
element.Text
element, wrap this childText
element in ap
paragraph element.Element lists
Block-level whitelist
article
aside
blockquote
caption
colgroup
col
div
dl
dt
dd
figure
figcaption
footer
h1
h2
h3
h4
h5
h6
header
li
main
ol
p
pre
section
table
tbody
thead
tfoot
tr
td
th
ul
Blacklist for complete removal
These elements will be completely removed, along with all their children. Q: Should we replace these with a " content removed" placeholder? We were discussing doing this for MathML, though I've decided to treat this the same as other embedded content and just remove this for now.
button
datalist
fieldset
form
input
label
legend
meter
optgroup
option
output
progress
select
textarea
area
img
map
picture
source
audio
track
video
embed
math
object
param
svg
details
dialog
summary
canvas
noscript
script
template
data
link
time
style
nav
br
hr
Elements with special handling
q
: A quote. We will wrap theinnerText
quote marks ("
s).sub
andsup
: Sub- and super-script. We will prefix theinnerText
with_
for subscript and^
for superscript.Undecided
iframe
: This content will be rendered and visible in the original article, so I feel we should really render it. It seems thesrcdoc
attribute can contain HTML source, which takes precedence over thesrc
URL, so we should probably include this somehow if present. If theiframe
only includes a URLsrc
, I guess we should fetch the document ourselves if it is not fetched automatically by the GET request.Remaining elements
These elements will be replaced with their
innerText
(concatenated text representations of all their children, with sensible whitespace rules for concatenation).a
abbr
address
b
bdi
bdo
cite
code
del
dfn
em
i
ins
kbs
mark
q
rb
ruby
rp
rt
rtc
s
samp
small
span
strong
u
var
wbr
Notes on classification of elements:
abbr
: Abbreviation or acronym. Thetitle
element can contain the expanded version of the abbreviation (e.g.<p>The <abbr title="Web Hypertext Application Technology Working Group">WHATWG</abbr> started working on HTML in 2004.</p>
). We could present this astextContent (title)
(i.e.<p>The WHATWG (Web Hypertext Application Technology Working Group) started working on HTML in 2004.</p>
). However, this is unlikely to be displayed in the main browser rendering, so I'd suggest we just replace withinnerText
.bdi
andbdo
: Sets the text direction of an element, overriding any value inherited from parent elements. We suggest just using theinnerText
here as it's pretty much the only inline element we can't get rid of with genuinely no consequence. In this case text direction will be inferred from the parentdir
attribute and other cues, which I think we can live with. Also, for this project all sites are english so we should not see any issues in our immediate use case.br
: Readability.js wraps text separated by multiplebr
s inp
s, so we can safely remove any remainingbr
s. In the long run we may want to handle this ourselves.del
andins
: These indicate edits to the document, but the content will usually be displayed (with dome formatting to indicate an insertion or deletion). I think for us we should just keep the text either way as it is almost certainly visible to a user in a browser, so I think we should just replace withinnerText
.form
: We remove allform
elements and their contents.hr
: Horizontal rule. This is really a paragraph level thematic break (e.g. scene or topic change). We should really ensure preceding and following text blocks are wrapped in a block-level element (wrapping them in asection
orp
ourselves if not). However, this has the same complexity as wrapping text blocks separated by multiplebr
s, so for now I'd suggest we don't, but we may want to consider doing this later. It may be Readability.js does this, so we should check.ruby
: Used to provide ruby annotations, often representing additional representations of some base text, such as pronunciation guidance, translations or alternate character sets. Replacing theruby
element with the text contents of it's childrb
,rt
,rtc
,rp
elements is the standard fallback for user agents who cannot render ruby annotations and is what we will achieve by replacing all these elements with theirinnerText
.s
: Marks content as no longer relevant. The text is generally still displayed so I think we should just replace withinnerText
.