WAT: only unescape complete XML/HTML character entities

commoncrawl / ia-web-commons

Web archiving utility library

Apache License 2.0

9 stars 6 forks source link

WAT: only unescape complete XML/HTML character entities #19

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

Decoding of XML/HTML character entities (see #14) should be done in a safe way and only if they are complete. A sequence &or (in &order=lexical) must not be treated as entity &or;.

The Translate.decode(...) method of htmlparser.org is not safe in this point as it does not require a closing `;'.

sebastian-nagel commented 4 years ago

Note that only malformed URLs are mangled, the ampersand should be escaped also in HTML element attributes. Of course, this is a frequent error in HTML and invalid URLs/links might be cumbersome in case they're used to feed a crawler, construct a webgraph, etc.

When extracting text a lazy replacement (without a closing ;) is sometimes a good choice, eg. for sequences such as &nsp&nsp&nsp.

sebastian-nagel commented 4 years ago

Finally fixed using jsoup's Parser.unescapeEntities(...) instead which

supports HTML 5 entities and all other entities
provides a "safe" mode when entities in attributes are decoded