WAT: unescape XML/HTML character entities

The Common Crawl WAT files contain lot of XML/HTML entities which should be unescaped. For links/URLs the amount of values exceeds 10%. Examples (HTML snippet + WAT extract):

<img src="https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&amp;videoId=5808028819001" alt="Míchel Salgado, exjugador del Real Madrid: &quot;Cristiano Ronaldo es insustituible&quot;">

{
  "path": "IMG@/src",
  "alt": "Míchel Salgado, exjugador del Real Madrid: &quot;Cristiano Ronaldo es insustituible&quot;",
  "url": "https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&amp;videoId=5808028819001"
},

note that the problem applies to all kind of XML/HTML character entities:


<a href="https://secure.customersvc.com/wes/servlet/Show?WESPAGE&#x3D;iam/pages/home.jsp&amp;MSRSMAG&#x3D;FI">
EU Customer Service
</a>

{ "path": "A@/href", "text": "EU Customer Service", "url": "https://secure.customersvc.com/wes/servlet/Show?WESPAGE=iam/pages/home.jsp&MSRSMAG=FI" },


- in text

<a class="pdb-meta-link" href="http://www.madsack.de/" target="_blank" rel="nofollow"

© Verlagsgesellschaft Madsack GmbH & Co. KG

{ "path": "A@/href", "rel": "nofollow", "text": "© Verlagsgesellschaft Madsack GmbH & Co. KG", "url": "http://www.madsack.de/", "target": "_blank" },


- and attribute values

{ "property": "og:description", "content": "As Goal revealed on Tuesday, the Reds are in talks with Roma over signing the Brazil international, who would transform Jurgen Klopp's defence" },



The WAT extractor should replace the character entities with the corresponding character values to leverage the processing of the WAT files.

The changes in e0d23b8 have been used for the June 2019 crawl (CC-MAIN-2019-26). A comparison with two randomly selected WAT files from May and June, shows that the number of entities in JSON string values has dropped by a factor of 100:

from 1,019,102 in CC-MAIN-20190526105248-20190526131248-00063.warc.wat.gz
to 8,791 in CC-MAIN-20190619204313-20190619230313-00114.warc.wat.gz

The counts are based on a simple regex pattern which should give an acceptable approximation:

% zgrep '^{' CC-MAIN-20190526*.wat.gz | jq . | grep -E '&.{2,8};' | wc -l
1019102

A quick check of the remaining 9,000 entities showed the following reasons why there are still unescaped entities:

double escaped entities (&amp;) - probably errors on web pages in most cases. But since one might want to write "In HTML a literal ampersand must be written as &", recursively decoding entities isn't the best practice. It doesn't conform to the standard in any case.
entities not supported by htmlparser.org:
- '/', ,/, - XML/XHTML and HTML5 entities, cf. https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
- 😊/😊 - Unicode code points above the BMP which cannot be represented by a single Java char

commoncrawl / ia-web-commons

WAT: unescape XML/HTML character entities #14