OWASP / java-html-sanitizer

Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.
Other
834 stars 209 forks source link

tel URIs: Incorrect escaping due to missing RFC 3966 "tel:" URI syntax support/parsing #263

Open jmiserez opened 2 years ago

jmiserez commented 2 years ago

RFC 3966 "The tel URI for Telephone Numbers" defines the URI scheme "tel". The OWASP sanitizer currently does not handle these correctly: both escaping certain characters when it "MUST NOT", or not escaping characters when it "MUST". Testcases at the end.

This is not due to a bug in the OWASP sanitizer, but due to missing support for "tel" URI syntax described in Section 3 of RFC 3966. While the "tel" URI syntax is based on the generic syntax outlined in RFC 2396 "Uniform Resource Identifiers (URI): Generic Syntax", there are important differences.

The most important difference is described in these two sentences of Section 3:

If the reserved characters "+", ";", "=", and "?" are used as delimiters between components of
the "tel" URI, they MUST NOT be percent encoded. 
These characters MUST be percent encoded if they appear in tel URI parameter values.

Unfortunately, determining whether one of these characters is used as a delimiter or as a parameter value is not straightforward, and probably requires parsing the full syntax (also described in Section 3 in ABNF form).

Basic example: href="tel:+1234567890" -> href="tel:+1234567890": The leading plus is a separator and "MUST NOT" be percent encoded according to RFC 3966 But it depends on where the character is in the URI, e.g.: href="tel:tel:890;phone-context=;+123-4-567" -> href="tel:tel:890;phone-context=+123-4-567": The equals is a separator here and "MUST NOT" be escaped but the plus in the parameter is not a separator and thus "MUST" be percent encoded according to RFC 3966

This does matter in the real world: While browsers (tested Chrome & Firefox) will e.g. still recognize some misencoded telephone numbers, E-Mail clients (Outlook, Apple Mail, Gmail) or telephony software (MS Teams) are more restrictive and no longer recognize the numbers. In any case I would argue that a HTML sanitizer library should probably aim for correctness here, even if it still works in some browsers.

Test cases demonstrating the issue, with comments:

Simply add the testcases to your SanitizersTest and run.

I already alluded to this before in a comment last year (https://github.com/OWASP/java-html-sanitizer/issues/21#issuecomment-926057012), but did not follow up then with proper testcases. I hope this report is slightly more helpful in understanding the issue.