Closed GoogleCodeExporter closed 9 years ago
Are there any other constraints on output that your XML parser requires?
Does it recognize HTML specific entities like '?
Does it disallow codepoints not in
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets , e.g. control characters
besides \t \r \n and orphaned surrogates whether escaped or not?
Original comment by mikesamuel@gmail.com
on 18 Sep 2012 at 9:01
Hello,
thanks for your response.
I use the SAXParser from JDK (OpenJDK implementation) to further process the
output (to convert e-mail addresses to bitmaps and to remove duplicate element
attributes and unnecessary whitespaces).
It does recognize '.
It seems to not allow code points outside of the ranges defined in
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets (i tried some other
control characters and some of the code points reserved for surrogates).
However, I did not know what does "escaped" and "not escaped mean", i tried
only this form: &#x???; Will this be a problem in some cases, or it is the
desired and correct behavior in this case of html processing?
Original comment by mila...@gmail.com
on 19 Sep 2012 at 6:40
> However, I did not know what does "escaped" and "not escaped mean", i tried
only this form: &#x???; Will this be a problem in some cases, or it is the
desired and correct behavior in this case of html processing.
By escaped I mean the sequence of chars seen by the XML parser contains
'&', '#', '8', ';'
which represents control character 8 in HTML,
but by "not escaped", I mean the sequence of chars seen by the XML parser
contains control character 8.
> Will this be a problem...
It will not be a problem. I ask because if I am going to try and ensure that
the output of the HTML sanitizer is parsable by XML parsers, then I would
rather solve the problem in one release instead of giving you a release and
have you file another bug because the parser now just fails a little later on
the same input.
Original comment by mikesamuel@gmail.com
on 19 Sep 2012 at 6:52
To summarize all cases (for the Java SAXParser):
escaped control character (0x7) - Java string ""
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 23; Character
reference "&#
escaped orphaned surrogate (0xD800) - Java string "�"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 27; Character
reference "&#
unescaped control character (0x7) - Java string "\u0007"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 19; An invalid XML
character (Unicode: 0x7) was found in the element content of the document.
unescaped orphaned surrogate (0xD800) - Java string "\uD800"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 20; An invalid XML
character (Unicode: 0xd800) was found in the element content of the document.
escaped or unescaped 0x9 and 0xd7ff (from the ranges in
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets) are working correctly
Original comment by mila...@gmail.com
on 20 Sep 2012 at 8:36
I believe
http://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=114
addresses this issue.
It does three things.
(1) Makes sure that characters not in the XML Character set do not make it to
the policy as inputs. All invalid code-units are elided.
(2) Makes sure that similar characters that are emitted by a policy are elided
on rendering so will not appear in the HTML output.
(3) Adds the self-closing tag marker to all HTML5 void elements (
http://www.w3.org/TR/html-markup/syntax.html#void-element ), so instead of
seeing "<br>" in the output, you will see "<br />".
r114 is not yet package into a release. Let me know if that works for you and
I will put out a release.
Original comment by mikesamuel@gmail.com
on 21 Sep 2012 at 10:25
It works. I tried:
- if is solved the original problem with <br> <hr> etc.
- if it removes the characters (escaped or not escaped) which are not parseable
by the XML parser (even when they are in tag names, attribute names or
attribute values)
- if policy allow/disallow rules work when there are such characters in the tag
or attribute names (but I am not sure if I tried all the possible cases)
thanks
Original comment by mila...@gmail.com
on 22 Sep 2012 at 12:41
Release 117 includes the XML compatibility changes and is now available via the
Downloads tab and via maven. I'm marking this issue closed. Please reopen if
you run into related problems with the new release.
Change log :
http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/CHANGE_LOG.html
Original comment by mikesamuel@gmail.com
on 22 Sep 2012 at 11:07
Original issue reported on code.google.com by
mila...@gmail.com
on 18 Sep 2012 at 6:59