commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

WAT: unescape XML/HTML character entities #14

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

The Common Crawl WAT files contain lot of XML/HTML entities which should be unescaped. For links/URLs the amount of values exceeds 10%. Examples (HTML snippet + WAT extract):

<img src="https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&amp;videoId=5808028819001" alt="Míchel Salgado, exjugador del Real Madrid: &quot;Cristiano Ronaldo es insustituible&quot;">

{
  "path": "IMG@/src",
  "alt": "Míchel Salgado, exjugador del Real Madrid: &quot;Cristiano Ronaldo es insustituible&quot;",
  "url": "https://comeresaprensa-a.akamaihd.net/pmd/78527749001/201807/78527749001_5808031518001_5808027492001-th.jpg?pubId=86746484001&amp;videoId=5808028819001"
},

{ "path": "A@/href", "text": "EU Customer Service", "url": "https://secure.customersvc.com/wes/servlet/Show?WESPAGE&#x3D;iam/pages/home.jsp&amp;MSRSMAG&#x3D;FI" },


- in text

<a class="pdb-meta-link" href="http://www.madsack.de/" target="_blank" rel="nofollow"

© Verlagsgesellschaft Madsack GmbH & Co. KG

{ "path": "A@/href", "rel": "nofollow", "text": "© Verlagsgesellschaft Madsack GmbH & Co. KG", "url": "http://www.madsack.de/", "target": "_blank" },


- and attribute values

{ "property": "og:description", "content": "As Goal revealed on Tuesday, the Reds are in talks with Roma over signing the Brazil international, who would transform Jurgen Klopp&#39;s defence" },



The WAT extractor should replace the character entities with the corresponding character values to leverage the processing of the WAT files.
sebastian-nagel commented 5 years ago

The changes in e0d23b8 have been used for the June 2019 crawl (CC-MAIN-2019-26). A comparison with two randomly selected WAT files from May and June, shows that the number of entities in JSON string values has dropped by a factor of 100:

The counts are based on a simple regex pattern which should give an acceptable approximation:

% zgrep '^{' CC-MAIN-20190526*.wat.gz | jq . | grep -E '&.{2,8};' | wc -l
1019102

A quick check of the remaining 9,000 entities showed the following reasons why there are still unescaped entities: