Closed sebastian-nagel closed 5 years ago
The changes in e0d23b8 have been used for the June 2019 crawl (CC-MAIN-2019-26). A comparison with two randomly selected WAT files from May and June, shows that the number of entities in JSON string values has dropped by a factor of 100:
The counts are based on a simple regex pattern which should give an acceptable approximation:
% zgrep '^{' CC-MAIN-20190526*.wat.gz | jq . | grep -E '&.{2,8};' | wc -l
1019102
A quick check of the remaining 9,000 entities showed the following reasons why there are still unescaped entities:
&
) - probably errors on web pages in most cases. But since one might want to write "In HTML a literal ampersand must be written as &
", recursively decoding entities isn't the best practice. It doesn't conform to the standard in any case.'
, ,/,
- XML/XHTML and HTML5 entities, cf. https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references😊
- Unicode code points above the BMP which cannot be represented by a single Java char
The Common Crawl WAT files contain lot of XML/HTML entities which should be unescaped. For links/URLs the amount of values exceeds 10%. Examples (HTML snippet + WAT extract):
{ "path": "A@/href", "text": "EU Customer Service", "url": "https://secure.customersvc.com/wes/servlet/Show?WESPAGE=iam/pages/home.jsp&MSRSMAG=FI" },
<a class="pdb-meta-link" href="http://www.madsack.de/" target="_blank" rel="nofollow"
{ "path": "A@/href", "rel": "nofollow", "text": "© Verlagsgesellschaft Madsack GmbH & Co. KG", "url": "http://www.madsack.de/", "target": "_blank" },
{ "property": "og:description", "content": "As Goal revealed on Tuesday, the Reds are in talks with Roma over signing the Brazil international, who would transform Jurgen Klopp's defence" },