Html entities encoding should be configurable (via directive ?)

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. scan any string containing special characters (accents) with latest version 
of antisamy

What is the expected output? What do you see instead?
Previous version didn't change these characters, last version converts these 
characters to html entities.

What version of the product are you using? On what operating system?
1.4.2 / Google App Engine

Original issue reported on code.google.com by f.masu...@gmail.com on 29 Dec 2010 at 9:52

GoogleCodeExporter commented 8 years ago

We're using antisamy 1.4 and seeing the same issue. 

The following has been suggested as a workaround:

http://stackoverflow.com/questions/3246739/how-to-not-transform-special-characte
rs-to-html-entities-with-owasp-antisamy

Original comment by stefan.l...@gmail.com on 13 Jan 2011 at 12:38

GoogleCodeExporter commented 8 years ago

unescaping html can be dangerous I think, it's really annoying that special 
characters are encoded because search engine like hibernate search failed to 
index text with html entities. 
Is there a simple and secure workaround for that ?

Original comment by jerome.c...@gmail.com on 28 Jan 2011 at 3:10

GoogleCodeExporter commented 8 years ago

Dealing with non-ASCII character sets is tricky, and we will always err on the 
side of caution when dealing with them. Also, AntiSamy is only a tool for the 
domain of HTML, where the accented character and the HTML entity are 
functionally equivalent. Therefore I am going to mark this "WontFix" unless 
someone has a compelling reason why HTML or browsers should see different 
behavior as opposed to your search engine API.

That being said, I think you have options. Can you HTML-encode your input 
before you hand it off to the search engine? That way they'll be speaking the 
same language.

Original comment by arshan.d...@gmail.com on 3 Feb 2011 at 8:00

Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

Well, I think html entities (for special characters like accentued characters) 
are not useful when we used UTF-8. The fact is I can't managed differently text 
I save in my Database and text I index in search engine. I use hibernate search 
and all is automated so I can't do a special treatment for that.

Is it so difficult to do this little feature ? I don't understand why it is so 
ugly and difficult to implement ?

Original comment by jerome.c...@gmail.com on 4 Feb 2011 at 8:56

GoogleCodeExporter commented 8 years ago

I totally agree with Jerome. I have a use case where we enter html into a text 
area and persist the sanitized version. Whenever a user comes back to edit the 
html, the non-ascii characters have been translated into html entities (not 
what the user enteresd). One might argue that we should persist exactly what 
the user entered, and sanitized when displaying html, but I'd prefer not have 
scary html in my database at all.

Original comment by stefan.l...@gmail.com on 4 Feb 2011 at 1:58

GoogleCodeExporter commented 8 years ago

Yep, as Stefan, I'm sanitizing html entered in a textarea by my users before 
persisting it.  And as Jerome, I'm using UTF-8.

An option to enable/disable this behavior would be great.

Original comment by f.masu...@gmail.com on 4 Feb 2011 at 2:14

GoogleCodeExporter commented 8 years ago

It will add a lot of attack surface to AntiSamy since I'll have to write a new 
serializer to escape non-ASCII characters according to some specification (that 
you haven't articulated yet). Which characters should be encoded, and which 
shouldn't, and why? Have you thought through the security considerations? How 
will this affect all the major browsers?

If you think it's trivial to implement, I'll be happy to look at patches.

Original comment by arshan.d...@gmail.com on 4 Feb 2011 at 7:32

GoogleCodeExporter commented 8 years ago

What is strange is that it was working nicely when we used it at first.  But 
after updating to latest version, the problem arose and we had to modify our 
code accordingly.

We don't want AntiSamy to be less secure for sure :-)

Original comment by f.masu...@gmail.com on 4 Feb 2011 at 8:08

GoogleCodeExporter commented 8 years ago

If a clear specification is given and a reasonable explanation on why it won't 
cause problems is provided, I will absolutely write the serializer needed to 
accomplish this. Until then, I am going to mark this as "WontFix".

Original comment by arshan.d...@gmail.com on 16 Feb 2011 at 1:55

Changed state: WontFix
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 8 years ago

We have the same problem with german umlauts. Perhaps it would be an acceptable 
solution to offer a white-list for charcters which shouldn't be escaped? With 
such a configuration option we could add all german umlauts which definitely 
won't hurt anybody. ;-)

Original comment by ewert%ne...@gtempaccount.com on 17 Feb 2011 at 11:33

GoogleCodeExporter commented 8 years ago

I have exactly the same problem. Unfortunately AntiSamy is no longer usable for 
me. My users are Portuguese speakers and they use a lot of accents. When they 
go to edit their post, they're presented with all sorts of jumbo-mumbo 
character codes in the text area.

The characters I need are:  Á, Â, Ã, À, Ç, É, Ê, Í, Ó, Ô, Õ, Ú, Ü, 
and their lower case equivalents. Also, I don't understand why quote marks "" 
need to be encoded.

Perhaps I misunderstood the use of AntiSamy, but can't I give people the chance 
to edit their input without showing them weird codes and what not?

Original comment by husseini...@gmail.com on 15 Jun 2011 at 3:04

dqw / owaspantisamy

Html entities encoding should be configurable (via directive ?) #99