dqw / owaspantisamy

Automatically exported from code.google.com/p/owaspantisamy
0 stars 0 forks source link

Antisamy confuses nested elements with empty elements and deletes them #98

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Steps to reproduce the problem:
1. Configure antisamy with this policy: "antisamy-myspace-1.4.2.xml"
2. Use antisamy version 1.4.2 or 1.4.1
3. Scan this text: "<p><i><q><h2>text</h2></q></i></p>"

What is the expected output? 
Nothing in the text is prohibited by the policy so the cleaned text should 
match the input text.  No errors should be reported.

What do you see instead?
The cleaned text looks like this: "<p><i /></p>\n<h2>text</h2>".
Here are the reported errors:
"The q tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects., 
The p tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects."
It paragraph closed tag gets messed up and the quote tag is completely removed.

What version of the product are you using? On what operating system?
This behavior shows up in 1.4.2 and 1.4.1.  However, it works as expected in 
1.3.

Original issue reported on code.google.com by cam.morris@gmail.com on 17 Dec 2010 at 6:03

GoogleCodeExporter commented 8 years ago
For extra fun try scanning this 
string:"<UL><SUP><SUB><STRONG><SPAN><SMALL><PRE><P><OL><LI><I><H6><H5><H4><H3><H
2><H1><HR><EM><FONT><DD><DT><DL><CODE><BR><BLOCKQUOTE><BIG><B><ADDRESS><A>filler

text</A></ADDRESS></B></BIG></BLOCKQUOTE></BR></CODE></DL></DT></DD></FONT></EM>
</HR></H1></H2></H3></H4></H5></H6></I></LI></OL></P></PRE></SMALL></SPAN></STRO
NG></SUB></SUP></UL>"

The cleansed version on 1.4.1 and 1.4.2 are this:
<ul><sup><sub><strong><span><pre><p><small><ol><li /></ol></small></p><h1><hr 
/><em><font><dt /></font></em></h1></pre></span></strong></sub></sup></ul>

Errors reported are this:
[The small tag was empty, and therefore we could not process it. The rest of 
the message is intact, and its removal should not have any side effects., 
The small tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects., 
The i tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects., 
The h6 tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects., 
The h5 tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects., 
The h4 tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects., 
The h3 tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects., 
The h2 tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects., 
The dd tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects., 
The p tag was empty, and therefore we could not process it. The rest of the 
message is intact, and its removal should not have any side effects.]
Please provide any additional information below.

Original comment by cam.morris@gmail.com on 17 Dec 2010 at 6:04

GoogleCodeExporter commented 8 years ago
A colleague pointed out that all the offending examples I came up with appear 
to be uncommon usage.  He thought that it they probably do not conform with 
standard HTML.  So I did some browsing through the XHTML Schema 
http://www.w3.org/TR/2010/REC-xhtml-basic-20101123/#a_smodules.  Sure enough 
there is a massive structure of elements and what they can contain.  Looking 
through the tags that are offending, they contain tags they should not contain. 
For example an H6 doesn't contain an H5, or vice-versa. A Q tag should not 
contain an H2 tag, etc.

Original comment by cam.morris@gmail.com on 11 Jan 2011 at 10:10

GoogleCodeExporter commented 8 years ago
Your colleague is correct. NekoHTML (our HTML processor) will remove "illegal" 
HTML, such as nested anchor tags, etc. I don't necessarily care about what the 
spec thinks is illegal, so if you have a compelling and common example we can 
revisit.

I think there was also a typo in your first comment - you said it worked 
correctly in 1.3 - did you mean 1.4.3 or truly 1.3? Can you try it against HEAD 
if you meant 1.3?

Original comment by arshan.d...@gmail.com on 3 Feb 2011 at 8:05

GoogleCodeExporter commented 8 years ago
Closing.

Original comment by arshan.d...@gmail.com on 23 Feb 2011 at 1:20