Sanitizer html-encodes characters in attribute url

GoogleCodeExporter commented 9 years ago

Santizing:

<a href="ftp://site.com:user@host/file.txt">click here</a>

the '@' is replaced by '&#64;'. However, the href and src attribute values are 
URLs, not HTML text, so I believe the '@' should be left unencoded, or if 
anything be URL-encoded.

Another context where I run into this is sanitizing email html content. It 
sometimes points to attached images using a cid: (rfc2392) URL, eg:

<img src="image6@example.domain">

What version of the product are you using? On what operating system?
r173, linux, java 6.

Thanks!

Fred

Original issue reported on code.google.com by fred.lin...@gmail.com on 8 Jun 2013 at 8:26

GoogleCodeExporter commented 9 years ago

How is this causing problems?

In HTML,

    <a href="ftp://site.com:user@host/file.txt">click here</a>
    <img src="image6@example.domain">

should be semantically equivalent to

    <a href="ftp://site.com:user@host/file.txt">click here</a>
    <img src="image6@example.domain">

since character references in HTML attributes are decoded before the attribute 
value is computed.

Original comment by mikesamuel@gmail.com on 19 Jun 2013 at 10:16

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

On Jun 20, 2013, at 1:17, "owasp-java-html-sanitizer@googlecode.com"
<owasp-java-html-sanitizer@googlecode.com> wrote:

Hi Mike,

I am presenting emails in a browser. Some of the image references in
the HTML part are via rfc2392 cid:{url-addr-spec} to attached images
that have Content-ID: <{url-addr-spec}>. Since the browser won't
resolve those, I replace them in the content, prior to sending to the
browser, with URLs to the corresponding image in our attachment store.
(Email stored as received. On display request sanitized, then
processed for cid-reference replacement -  work-around: do
cid-replacement, then sanitize).

A fully correct implementation would parse the document as HTML,
canonicalize the img src attribute value (first as CDATA, then as URL,
then as rfc822 addr-spec), then replace it based on lookup of
canonicalized (as URL then as rfc822 addr-spec) content-ids.

My implementation uses a regexp to do the substitution. That works
with the assumption that the img src attribute url-addr-spec and
content-id are canonicalized, which in practice is virtually always is
the case.

I understand that what I'm doing is not correct, so I'm a bit
embarrassed and can't make a compelling argument. The replacement of @
with the HTML entity reference breaks the simplistic approach. If this
replacement by the sanitizer is not necessary for security, then I'd
rather have them unaltered or move towards canonical/simplified form.

I'd also be happy to understand why the sanitizer must replace @ with
@ and redo my part the right way :-).

Thanks!
Fred

Original comment by fred.lin...@gmail.com on 20 Jun 2013 at 3:26

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Have you tried allowUrlProtocols("cid", "mid") possibly combined with an 
AttributePolicy to do any mapping from cid: URLs to something you can serve.

For reference 
http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/distrib/javadoc/org/ow
asp/html/HtmlPolicyBuilder.html#allowUrlProtocols%28java.lang.String...%29 :
> Adds to the set of protocols that are allowed in URL attributes. For each URL 
attribute that is allowed, we further constrain it by only allowing the value 
through if it specifies no protocol, or if it specifies one in the 
allowedProtocols white-list.

http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/distrib/javadoc/org/ow
asp/html/AttributePolicy.html
> A policy that can be applied to an HTML attribute to decide whether or not to 
allow it in the output, possibly after transforming its value.

----

For my reference:
RFC 2392 references 822, not 2822 and there is no update that switches to 2822 
so any addr-spec normalization would have to output to the intersection of 
822/2822 which differ around white-space in domains and other places according 
to 2822/Appendix.B that might introduce IPv6 issues in domain literals.

Original comment by mikesamuel@gmail.com on 20 Jun 2013 at 2:14

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

That looks like a way to go. Thanks!

FWIW - I've always seen rfc2822 as a more sane version of rfc822 that disallows 
some complex and [virtually] never used ways of making simple things like 
addresses complex (eg by putting whitespace and comments between atoms). I 
would treat it as rfc2822 in the rfc2392 context and accept that someone 
technically could use rfc822-legal syntax that I would reject.

Original comment by fred.lin...@gmail.com on 23 Jun 2013 at 6:10

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Original comment by mikesamuel@gmail.com on 28 Feb 2014 at 9:59

Changed state: WontFix
Added labels: ****
Removed labels: ****

1049884729 / owasp-java-html-sanitizer

Sanitizer html-encodes characters in attribute url #13