avoid strict html entity encoding

ren-zhijun-oracle commented 15 years ago

Hi!

Mojarra encodes any any non-ascii character using the html entity encoding or the &#xxx; html encoding with h:outputText.

I'd like to know what you think about an enhancement that one is able to configure this behavior? It should be sufficent to encode the "dangerous" caracters like <>, etc, other characters could be passed through and written in the charset the developer has choosen for the page, that is, a simple "write thorugh" (if configured) in HtmlUtils.writeText should do the trick, no?

I can try and attach a patch if wanted.

Environment

Operating System: All Platform: All

Affected Versions

[1.2_08]

ren-zhijun-oracle commented 6 years ago

Issue Imported From: https://github.com/javaserverfaces/mojarra/issues/755
Original Issue Raised By:@javaserverfaces
Original Issue Assigned To: @glassfishrobot
Original Issue Closed By:@javaserverfaces

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented Reported by imario

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented @rlubke said: This looks similar to issue 727 [1]. However, 727 only covers the case where a decimal ref would be written and not the html entities.

If you'd like to submit a patch that builds off 727, that would be great.

[1] https://javaserverfaces.dev.java.net/issues/show_bug.cgi?id=727

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented @rlubke said: I'm no expert in this area, but I wonder if it's possible to get rid of the configuration options altogether and base the behavior on the current response character set?

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented imario said:

I'm no expert in this area, but I wonder if it's possible to get rid of the configuration options altogether and base the behavior on the current response character set?

I am not sure about that. The internal representation of a Java String will always be UTF-16 (hopefully ). Which means, that a String can hold (nearly?) any character available in IT world.

AFAIK, in UTF-16 any character <=0xFF will map to the ISO-8859-1 charset. Now, when it comes to render the string using a page encoding different to UTF or ISO-8859-1 it might not be possible to map the character accordingly, which means we have to html entity encode it again.

In other words: 1) we can avoid any html entity encoding if the output is UTF 2) we can avoid any ISO-8859-1 mapped html entity encoding if the output is UTF-8 or ISO-8859-1 (have to look at ISO-8859-15; the euro-character-charset in more detail) 3) else we have to stick with current html entity encoding

I think that makes sense.

I'll try to create a patch following the above rules.

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented imario said: Shall I prepare a patch which keeps BUG727 or can I prepare a new one which will replace BUG727 with the logic outlined above. Which means the "com.sun.faces.disableUnicodeEscaping" configuration can be removed again.

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented @rlubke said: I believe we added the config option as there were backwards compatibility concerns, however, thinking more about this if the rules you've outlined are sufficient then it would be nice to get by without yet another config option.

Adding Mike to the CC list as he may have input.

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented youngm said: One thing to be aware of while writing this patch is it can be very difficult to attempt to auto detect the response charset. There are any number of areas between the browser, and the responsewriter that can mess up the encoding of the response. For Example, some filter may be creating a response buffer set to encode to ISO-8859-1. If the faces response encoding is set to UTF-8 and the ResponseWriter assumes it doesn't need to encode anything we may run into trouble. The original switch was put in there to reward people who put forth the effort to ensure that everything between the browser and the responsewriter are all using the same encoding.

So if you want to include ISO8859_1_Entities entities in there that's great but keep in mind it should be overridable and I think it should default to always HTML encode as the "always work" scenario.

One possible patch might be to check the disableUnicodeEscaping flag. If it is true then you can check the response encoding to determing if you should html encode the ISO8859_1_Entities.

Just a thought.

Mike

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented imario said: Ok, I've a patched version running here now which keeps the configuration flag in mind and does its processing only if disableUnicodeEscaping=true.

What we have now is:

0) disableUnicodeEscaping=false, so encode everything a) Allow "pass through" mode if output is set to UTF-8 b) Allow "pass through" of ISO-8859-1 chars if output is set to UTF-8 c) Html-Encode if the above said does not match

But this now means, if one has output set to ISO-8859-15 s/he gets the same output as with disableUnicodeEscaping=false, even though, if Java will be able to convert the UTF-16 char to a corresponding ISO-8859-15 byte.

The question is, do we want to support scenario d?

d) Allow "pass through" even if the output encoding might not be able to represent the character.

Which then results into having "?" rendered if Java can not convert the character.

While I think it would be nice to support d, I ask myself if it shouldn't be common to use an UTF charset for any web-page rendering. You can configure that in Java without having to touch the application. So very easy. I think the most common use-cases are 0 and a.

After a code-cleanup I should be able to attach a patch for review in the next 48 hours.

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented imario said: Sorry, b should read:

b) Allow "pass through" of ISO-8859-1 chars if output is set to ISO-8859-1, html encode any unicode character.

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented youngm said: Sounds good.

I think the only way that we could support an option "d" would be to add another configuration flag to tell JSF to never encode ISO-8859-1 entities which I believe is unnecessary.

I think that the case of someone using something other than ISO-8859-1 or UTF-8 is going to be rare. If that case comes up they will just have to live with html encoded ISO-8859-1 characters.

Mike

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented imario said: Created an attachment (id=658) patch to avoid unicode/iso escaping if possible (and allowed through configuration)

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented imario said: JCA/SCA sent to JDIC_JCA_submit@sun.com

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented @rlubke said: r=rlubke. Mike - any comments?

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented youngm said: I like the ISO escaping you added.

However, it looks like that even if the user specifies to "disableUnicodeEscaping" we are still doing a check to see if we are really using unicode. I'm not much of a charset expert but would there ever be a situation where an application was not using a unicode charset but wanted to disable escaping anyway?

The configuration option in 727 kind of gave the power to the user to determine that they are using a charset that doesn't need to be escaped. This patch limits that power by only letting them disable escaping if they are using a unicode charset that is in our list.

But then again I'm not a charset expert so maybe there is never a case to not escape one of the charsets we specify?

If that is not the case then I would propose this code snippet instead:

if (disableUnicodeEscaping) { escapeUnicode = false; if (HtmlUtils.isISO8859_1encoding(charsetName) || HtmlUtils.isUTFencoding(charsetName))

{ escapeIso = false; }

}

Just a thought. Otherwise: r=youngm

Mike

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented imario said:

If the faces response encoding is set to UTF-8 and the ResponseWriter assumes it doesn't need to encode anything we may run into trouble

From this comment I was under the impression that we should try hard to provide a correct html output in any case. Means, even if the developer misconfigured the system we should output html code which always renders the correct character on the client side.

Probably I got that wrong.

The current solution allows to have a correct output with german umlauts etc even if you use US-ASCII. Which is nice I think.

But for sure, I've created the list of UTF charsets out of the out of the output of Charset.listAvailable (or so) on a windows xp machine, I also do not know if it is complete. But wouldn't it be very unlikely to use anything else then UTF-8 or UTF-16?

If we change the escape* detection as you outlined, then why bind the ISO escaping to the charset and do not simply not encode it too? We can argue that the developer wanted it that way. For sure, that might lead to having "?" rendered if the response was not configured correctly.

What about keeping my proposed code, but allowing "disableUnicodeEscaping" to be "true|false|auto"?

The documentation then will be:

"disableUnicodeEscaping" allows to control if unicode characters not renderable by the response stream get encoded or not. When set to:

"true" unicode characters will be encoded always.
"false" unicode characters will never be encoded. This might lead to "?" in the response stream if the charset of the response stream is not able to render the unicode character correctly. For example, you can't render german umlauts with a response stream set to US-ASCII.
"auto" the system will automatically switch between encoding or not depending on the response stream's charset. If it uses a UTF charset no escaping will take place, if it uses the charset ISO-8859-1 only the "upper" unicode characters will be encoded. "auto" can be used e.g. for applications which take the Accept-Charset http header-field into account and change the response stream accordingly. "auto" then will ensure that even if the client only accepts US-ASCII an unicode character will be rendered correctly (given that the browser is aware of all the html entities )

What do you think?

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented imario said: sorry, I mixed "true" and "false" in the above doc.

true: encodeIso=false encodeUnicode=false

false: encodeIso=true encodeUnicode=true

auto: encodeIso=response stream not unicode and not ISO-8859-1 encodeUnicode=response stream not unicde

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented youngm said: Again I don't know much about charsets and when one would be used over another. If you think it would be very rare that something other than a UTF charset would be used then I think we're good with the way the patch is.

Mike

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented imario said: Created an attachment (id=660) updated patch with enhanced configuration option disableUnicodeEscaping=true|false|auto

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented youngm said: r=youngm

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented @rlubke said: Changes applied (svn rev r5033). Thanks for the patch.

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented @rlubke said:

- - Issue 681 has been marked as a duplicate of this issue. ***

ren-zhijun-oracle commented 12 years ago

@javaserverfaces Commented @manfredriem said: Closing issue out

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented File: escaping.patch Attached By: imario

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented File: escaping.patch Attached By: imario

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented Was assigned to rlubke

ren-zhijun-oracle commented 7 years ago

@javaserverfaces Commented This issue was imported from java.net JIRA JAVASERVERFACES-751

ren-zhijun-oracle commented 15 years ago

@javaserverfaces Commented Marked as fixed on Sunday, July 13th 2008, 9:44:09 am

eclipse-ee4j / mojarra

avoid strict html entity encoding #755

Environment

Affected Versions