Diacritics are html encoded by scan / getCleanHTML

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Pass a string with diacritics into getCleanHTML (such as - AMÍLCAR)
2.
3.

What is the expected output? What do you see instead?
It seems like if it is valid Html, then the output of this method should equal 
the input exactly - so I would expect AMÍLCAR as the output, but actually get 
- AM&Iacute;LCAR instead.

What version of the product are you using? On what operating system?
1.4.5, Windows 7

Please provide any additional information below.
It appears that this encoding occurs in the serialization of the dom in 
AntiSamyDOMScanner.java, i.e. in the following code:

if ("true".equals(policy.getDirective(Policy.USE_XHTML))) {
    XHTMLSerializer serializer = new XHTMLSerializer(sw, format);
    serializer.serialize(dom);
} else {
    HTMLSerializer serializer = new HTMLSerializer(sw, format);
    serializer.serialize(dom);
}

/*
 * Get the String out of the StringWriter and rip out the XML
 * declaration if the Policy says we should.
 */
String finalCleanHTML = sw.getBuffer().toString();

Everything appears fine before the serialization, but after, the value assigned 
to finalCleanHTML has the diacritics replaced.

If there is some setting that can change this please let me know, but I was 
unable to find anything.  Also, it might be helpful to know that it really does 
not matter what policy file I use.  I used about all of the default ones 
possible - antisamy-esapi.xml, antisamy-anythinggoes-1.3.xml, ... with the same 
results.

Original issue reported on code.google.com by mattmwhe...@gmail.com on 13 Mar 2012 at 2:48

GoogleCodeExporter commented 8 years ago

"It seems like if it is valid Html, then the output of this method should equal 
the input exactly"

This isn't actually true.  That's why it's recommended to use the cleaned 
results instead of using the number of errors as an indicator is dirty is safe. 
 Not only will the output be changed when malicious content is entered, but 
various encoding and formatting is also performed.

If you stick that string in an HTML context, it displays properly.

Original comment by tad...@gmail.com on 2 Apr 2012 at 7:21

GoogleCodeExporter commented 8 years ago

"If" is the operative word.  I agree with you that you have stated how it 
works, but I don't understand why "It seems like if it is valid HTML, then the 
output of this method should equal the input exactly" should not be true, or be 
very close to true.  I guess I don't really care about formatting, but, in the 
least I don't think non-malicious content should be modified.  I would expect 
unaccepted HTML to be stripped out, but would not expect the content to be 
encoded.  In my opinion encoding should occur on output, not on input, because 
you have no idea the context of the output.  We store it in a database, and 
after that, the final destination in not necessarily known.  It could be output 
into HTML, into JavaScript, into XML, ...  If it has been encoded into HTML 
format, that is rather problematic for other output contexts.

Original comment by mattmwhe...@gmail.com on 2 Apr 2012 at 8:07

GoogleCodeExporter commented 8 years ago

Sorry it took me so long to respond.

As for modifying the what you sent to AS, sometimes it is necessary.  It will 
add valid close tags if you specify XHTML output, it will add whitespace 
formatting if you choose to have it format, and it will HTML encode special 
characters.  It probably does some other manipulation to the original input, 
but those are some of the more common ones.

Now you might not agree with the last one I listed.  You state that "because 
you have no idea the context of the output".  Indeed you do, because AS will 
only validate input that will eventually live in an HTML context.  If you are 
sticking this in a JS context, then your app is currently vulnerable.  For 
example, if you send in "alert('blah')" into AS, this will validate as OK, 
since it appears to just be a string.  In other words, it *is* safe for that 
string to live in an HTML context, but more malicious JS strings would *not* be 
safe in a JS context.

As far as input that should live in another context other than HTML, you should 
*always* encode that input according to the context in which it will live.  
There's no reason to clean it if you encode it.  If you want to validate it, go 
right ahead, but don't use AS (unless it will live in HTML).  Again, take a 
whitelist approach and only allows a specific subset of characters.

I hope this answers your question and helps with your usage of AS in the future.

If I have said anything incorrectly, would one of the AS devs please correct 
me/clarify?

Original comment by tad...@gmail.com on 25 Apr 2012 at 1:50

GoogleCodeExporter commented 8 years ago

tad9ab is correct. We only work within HTML contexts - the problem becomes 
unmanageable without that restriction in place.

Original comment by arshan.d...@gmail.com on 24 Jun 2012 at 5:19

Changed state: WontFix

Shiva6780 / owaspantisamy

Diacritics are html encoded by scan / getCleanHTML #130