Closed GoogleCodeExporter closed 8 years ago
We're using antisamy 1.4 and seeing the same issue.
The following has been suggested as a workaround:
http://stackoverflow.com/questions/3246739/how-to-not-transform-special-characte
rs-to-html-entities-with-owasp-antisamy
Original comment by stefan.l...@gmail.com
on 13 Jan 2011 at 12:38
unescaping html can be dangerous I think, it's really annoying that special
characters are encoded because search engine like hibernate search failed to
index text with html entities.
Is there a simple and secure workaround for that ?
Original comment by jerome.c...@gmail.com
on 28 Jan 2011 at 3:10
Dealing with non-ASCII character sets is tricky, and we will always err on the
side of caution when dealing with them. Also, AntiSamy is only a tool for the
domain of HTML, where the accented character and the HTML entity are
functionally equivalent. Therefore I am going to mark this "WontFix" unless
someone has a compelling reason why HTML or browsers should see different
behavior as opposed to your search engine API.
That being said, I think you have options. Can you HTML-encode your input
before you hand it off to the search engine? That way they'll be speaking the
same language.
Original comment by arshan.d...@gmail.com
on 3 Feb 2011 at 8:00
Well, I think html entities (for special characters like accentued characters)
are not useful when we used UTF-8. The fact is I can't managed differently text
I save in my Database and text I index in search engine. I use hibernate search
and all is automated so I can't do a special treatment for that.
Is it so difficult to do this little feature ? I don't understand why it is so
ugly and difficult to implement ?
Original comment by jerome.c...@gmail.com
on 4 Feb 2011 at 8:56
I totally agree with Jerome. I have a use case where we enter html into a text
area and persist the sanitized version. Whenever a user comes back to edit the
html, the non-ascii characters have been translated into html entities (not
what the user enteresd). One might argue that we should persist exactly what
the user entered, and sanitized when displaying html, but I'd prefer not have
scary html in my database at all.
Original comment by stefan.l...@gmail.com
on 4 Feb 2011 at 1:58
Yep, as Stefan, I'm sanitizing html entered in a textarea by my users before
persisting it. And as Jerome, I'm using UTF-8.
An option to enable/disable this behavior would be great.
Original comment by f.masu...@gmail.com
on 4 Feb 2011 at 2:14
It will add a lot of attack surface to AntiSamy since I'll have to write a new
serializer to escape non-ASCII characters according to some specification (that
you haven't articulated yet). Which characters should be encoded, and which
shouldn't, and why? Have you thought through the security considerations? How
will this affect all the major browsers?
If you think it's trivial to implement, I'll be happy to look at patches.
Original comment by arshan.d...@gmail.com
on 4 Feb 2011 at 7:32
What is strange is that it was working nicely when we used it at first. But
after updating to latest version, the problem arose and we had to modify our
code accordingly.
We don't want AntiSamy to be less secure for sure :-)
Original comment by f.masu...@gmail.com
on 4 Feb 2011 at 8:08
If a clear specification is given and a reasonable explanation on why it won't
cause problems is provided, I will absolutely write the serializer needed to
accomplish this. Until then, I am going to mark this as "WontFix".
Original comment by arshan.d...@gmail.com
on 16 Feb 2011 at 1:55
We have the same problem with german umlauts. Perhaps it would be an acceptable
solution to offer a white-list for charcters which shouldn't be escaped? With
such a configuration option we could add all german umlauts which definitely
won't hurt anybody. ;-)
Original comment by ewert%ne...@gtempaccount.com
on 17 Feb 2011 at 11:33
I have exactly the same problem. Unfortunately AntiSamy is no longer usable for
me. My users are Portuguese speakers and they use a lot of accents. When they
go to edit their post, they're presented with all sorts of jumbo-mumbo
character codes in the text area.
The characters I need are: Á, Â, Ã, À, Ç, É, Ê, Í, Ó, Ô, Õ, Ú, Ü,
and their lower case equivalents. Also, I don't understand why quote marks ""
need to be encoded.
Perhaps I misunderstood the use of AntiSamy, but can't I give people the chance
to edit their input without showing them weird codes and what not?
Original comment by husseini...@gmail.com
on 15 Jun 2011 at 3:04
Original issue reported on code.google.com by
f.masu...@gmail.com
on 29 Dec 2010 at 9:52