clj-commons / hickory

HTML as data
Other
637 stars 52 forks source link

Duplicated text content due to misuse of the wholeText property #34

Closed njordhov closed 9 months ago

njordhov commented 8 years ago

Both hickory.core/as-hiccup and hickory.core/as-hickory use DOM's wholeText property to extract the text value of a dom node. However, instead of just returning the text content of a node, this property concatenates all Text nodes logically adjacent to the node.

This may lead to unexpected results, particularly when a parsed document is modified before converting it into hiccup or hickory. Transpiling a mozilla example of using wholeText:

(def doc (hickory.core/parse "<p>Thru-hiking is great! <strong>No insipid election coverage</strong> However, <a href=\"http://en.wikipedia.org/wiki/Absentee_ballot\">casting a ballot</a> is tricky.</p>"))
(def para (.item (.getElementsByTagName doc "p") 0))
(.removeChild para (.item (.-childNodes para) 1))

After the removal of the strong element from the paragraph (as-hiccup para) now returns:

[:p {} "Thru-hiking is great!  However, " "Thru-hiking is great!  However, " 
  [:a {:href "http://en.wikipedia.org/wiki/Absentee_ballot"} "casting a ballot"] " is tricky."]

Notice the duplicate text caused by wholeText concatenating adjacent text nodes for the two text nodes remaining after removing the originally interjecting strong element.

A fix is to call goog.dom/getRawTextContent in place of the wholeText property accessors in hickory.core.

njordhov commented 8 years ago

Note that JSoup's getWholeText is different than the wholeText property of DOM. The former is documented to:

Get the (unencoded) text of this text node, including any newlines and spaces present in the original

danielcompton commented 9 months ago

Fixed by https://github.com/clj-commons/hickory/pull/33