Closed GoogleCodeExporter closed 9 years ago
How is this causing problems?
In HTML,
<a href="ftp://site.com:user@host/file.txt">click here</a>
<img src="image6@example.domain">
should be semantically equivalent to
<a href="ftp://site.com:user@host/file.txt">click here</a>
<img src="image6@example.domain">
since character references in HTML attributes are decoded before the attribute
value is computed.
Original comment by mikesamuel@gmail.com
on 19 Jun 2013 at 10:16
On Jun 20, 2013, at 1:17, "owasp-java-html-sanitizer@googlecode.com"
<owasp-java-html-sanitizer@googlecode.com> wrote:
Hi Mike,
I am presenting emails in a browser. Some of the image references in
the HTML part are via rfc2392 cid:{url-addr-spec} to attached images
that have Content-ID: <{url-addr-spec}>. Since the browser won't
resolve those, I replace them in the content, prior to sending to the
browser, with URLs to the corresponding image in our attachment store.
(Email stored as received. On display request sanitized, then
processed for cid-reference replacement - work-around: do
cid-replacement, then sanitize).
A fully correct implementation would parse the document as HTML,
canonicalize the img src attribute value (first as CDATA, then as URL,
then as rfc822 addr-spec), then replace it based on lookup of
canonicalized (as URL then as rfc822 addr-spec) content-ids.
My implementation uses a regexp to do the substitution. That works
with the assumption that the img src attribute url-addr-spec and
content-id are canonicalized, which in practice is virtually always is
the case.
I understand that what I'm doing is not correct, so I'm a bit
embarrassed and can't make a compelling argument. The replacement of @
with the HTML entity reference breaks the simplistic approach. If this
replacement by the sanitizer is not necessary for security, then I'd
rather have them unaltered or move towards canonical/simplified form.
I'd also be happy to understand why the sanitizer must replace @ with
@ and redo my part the right way :-).
Thanks!
Fred
Original comment by fred.lin...@gmail.com
on 20 Jun 2013 at 3:26
Have you tried allowUrlProtocols("cid", "mid") possibly combined with an
AttributePolicy to do any mapping from cid: URLs to something you can serve.
For reference
http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/distrib/javadoc/org/ow
asp/html/HtmlPolicyBuilder.html#allowUrlProtocols%28java.lang.String...%29 :
> Adds to the set of protocols that are allowed in URL attributes. For each URL
attribute that is allowed, we further constrain it by only allowing the value
through if it specifies no protocol, or if it specifies one in the
allowedProtocols white-list.
http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/distrib/javadoc/org/ow
asp/html/AttributePolicy.html
> A policy that can be applied to an HTML attribute to decide whether or not to
allow it in the output, possibly after transforming its value.
----
For my reference:
RFC 2392 references 822, not 2822 and there is no update that switches to 2822
so any addr-spec normalization would have to output to the intersection of
822/2822 which differ around white-space in domains and other places according
to 2822/Appendix.B that might introduce IPv6 issues in domain literals.
Original comment by mikesamuel@gmail.com
on 20 Jun 2013 at 2:14
That looks like a way to go. Thanks!
FWIW - I've always seen rfc2822 as a more sane version of rfc822 that disallows
some complex and [virtually] never used ways of making simple things like
addresses complex (eg by putting whitespace and comments between atoms). I
would treat it as rfc2822 in the rfc2392 context and accept that someone
technically could use rfc822-legal syntax that I would reject.
Original comment by fred.lin...@gmail.com
on 23 Jun 2013 at 6:10
[deleted comment]
[deleted comment]
Original comment by mikesamuel@gmail.com
on 28 Feb 2014 at 9:59
Original issue reported on code.google.com by
fred.lin...@gmail.com
on 8 Jun 2013 at 8:26