empty-element tag transformed to start tag only

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. new HtmlPolicyBuilder
2. .allowElements("hr")
3. HtmlSanitizer.sanitize("<hr />", policy);

What is the expected output? What do you see instead?
expected - <hr />
instead - <hr>

What version of the product are you using? On what operating system?
r99

Please provide any additional information below.
For browsers the output <hr> is correct. However, it is not usable if we need 
some additional XML processing of the output.

Original issue reported on code.google.com by mila...@gmail.com on 18 Sep 2012 at 6:59

GoogleCodeExporter commented 9 years ago

Are there any other constraints on output that your XML parser requires?

Does it recognize HTML specific entities like '?
Does it disallow codepoints not in 
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets , e.g. control characters 
besides \t \r \n and orphaned surrogates whether escaped or not?

Original comment by mikesamuel@gmail.com on 18 Sep 2012 at 9:01

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Hello,
thanks for your response.

I use the SAXParser from JDK (OpenJDK implementation) to further process the 
output (to convert e-mail addresses to bitmaps and to remove duplicate element 
attributes and unnecessary whitespaces).
It does recognize '.
It seems to not allow code points outside of the ranges defined in 
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets (i tried some other 
control characters and some of the code points reserved for surrogates). 
However, I did not know what does "escaped" and "not escaped mean", i tried 
only this form: &#x???; Will this be a problem in some cases, or it is the 
desired and correct behavior in this case of html processing?

Original comment by mila...@gmail.com on 19 Sep 2012 at 6:40

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

> However, I did not know what does "escaped" and "not escaped mean", i tried 
only this form: &#x???; Will this be a problem in some cases, or it is the 
desired and correct behavior in this case of html processing.

By escaped I mean the sequence of chars seen by the XML parser contains
  '&', '#', '8', ';'
which represents control character 8 in HTML,
but by "not escaped", I mean the sequence of chars seen by the XML parser 
contains control character 8.

> Will this be a problem...

It will not be a problem.  I ask because if I am going to try and ensure that 
the output of the HTML sanitizer is parsable by XML parsers, then I would 
rather solve the problem in one release instead of giving you a release and 
have you file another bug because the parser now just fails a little later on 
the same input.

Original comment by mikesamuel@gmail.com on 19 Sep 2012 at 6:52

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

To summarize all cases (for the Java SAXParser):

escaped control character (0x7) - Java string ""
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 23; Character 
reference "&#

escaped orphaned surrogate (0xD800) - Java string "�"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 27; Character 
reference "&#

unescaped control character (0x7) - Java string "\u0007"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 19; An invalid XML 
character (Unicode: 0x7) was found in the element content of the document.

unescaped orphaned surrogate (0xD800) - Java string "\uD800"
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 20; An invalid XML 
character (Unicode: 0xd800) was found in the element content of the document.

escaped or unescaped 0x9 and 0xd7ff (from the ranges in 
http://www.w3.org/TR/2008/REC-xml-20081126/#charsets) are working correctly

Original comment by mila...@gmail.com on 20 Sep 2012 at 8:36

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I believe 
http://code.google.com/p/owasp-java-html-sanitizer/source/detail?r=114 
addresses this issue.

It does three things.
(1) Makes sure that characters not in the XML Character set do not make it to 
the policy as inputs.  All invalid code-units are elided.
(2) Makes sure that similar characters that are emitted by a policy are elided 
on rendering so will not appear in the HTML output.
(3) Adds the self-closing tag marker to all HTML5 void elements ( 
http://www.w3.org/TR/html-markup/syntax.html#void-element ), so instead of 
seeing "<br>" in the output, you will see "<br />".

r114 is not yet package into a release.  Let me know if that works for you and 
I will put out a release.

Original comment by mikesamuel@gmail.com on 21 Sep 2012 at 10:25

Changed state: Started
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

It works. I tried:
- if is solved the original problem with <br> <hr> etc.
- if it removes the characters (escaped or not escaped) which are not parseable 
by the XML parser (even when they are in tag names, attribute names or 
attribute values)
- if policy allow/disallow rules work when there are such characters in the tag 
or attribute names (but I am not sure if I tried all the possible cases)

thanks

Original comment by mila...@gmail.com on 22 Sep 2012 at 12:41

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Release 117 includes the XML compatibility changes and is now available via the 
Downloads tab and via maven.  I'm marking this issue closed.  Please reopen if 
you run into related problems with the new release.

Change log : 
http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/CHANGE_LOG.html

Original comment by mikesamuel@gmail.com on 22 Sep 2012 at 11:07

Changed state: Fixed
Added labels: ****
Removed labels: ****

1049884729 / owasp-java-html-sanitizer

empty-element tag transformed to start tag only #6