Closed sebastian-nagel closed 4 years ago
Note that only malformed URLs are mangled, the ampersand should be escaped also in HTML element attributes. Of course, this is a frequent error in HTML and invalid URLs/links might be cumbersome in case they're used to feed a crawler, construct a webgraph, etc.
When extracting text a lazy replacement (without a closing ;
) is sometimes a good choice, eg. for sequences such as &nsp&nsp&nsp
.
Finally fixed using jsoup's Parser.unescapeEntities(...) instead which
Decoding of XML/HTML character entities (see #14) should be done in a safe way and only if they are complete. A sequence
&or
(in&order=lexical
) must not be treated as entity∨
.The Translate.decode(...) method of htmlparser.org is not safe in this point as it does not require a closing `;'.