HtmlUnit / htmlunit-neko

HtmlUnit adaptation of NekoHtml
Apache License 2.0
17 stars 15 forks source link

StandardEncodingTranslator: Fix big5, euc-kr, koi8-u, etc regressions #111

Closed atnak closed 6 months ago

atnak commented 6 months ago

This PR does the following

This PR firstly attempts to fix regressions introduced in #110 by restoring these encodings:

Encoding Cause
windows-874 Was not supported in EncodingMap.fIANA2JavaMap
koi8-u Was not supported in EncodingMap.fIANA2JavaMap
big5 Adjusted form big5-hkscs was not supported in EncodingMap.fIANA2JavaMap
euc-kr Adjusted form windows-949 was not supported in EncodingMap.fIANA2JavaMap
x-user-defined Was not supported in EncodingMap.fIANA2JavaMap (obviously)

To do this, this PR removes the troublesome dependency on EncodingMap.fIANA2JavaMap and internally defines IANA_TO_JAVA_ENCODINGS.

In doing so, this PR also adds support to these encodings and labels that were not previously supported.

Encoding Labels
macintosh csmacintosh, mac, x-mac-roman
x-mac-cyrillic x-mac-ukrainian

Finally, this PR attempts to fix parts of HTMLScanner that will be confused by StandardEncodingTranslator.encodingNameFromLabel() returning a non-null value which is not a valid Java encoding. **

** With the exception of "replacement", see following heading.

This PR does not do the following

The following appears to be an issue but I didn't attempt to fix this due to limited knowledge:

The word "replacement" can make its way[1] into HTMLScanner.fJavaEncoding even though this field seems to be used as a sanitised source[2]. I don't know how to fix this or what we want to do with "replacement" so I simply let it through[3] in this PR.

[1] https://github.com/HtmlUnit/htmlunit-neko/blob/76bb1d16575d02f443c037aa54ecc7395d91b7f2/src/main/java/org/htmlunit/cyberneko/HTMLScanner.java#L3019 [2] https://github.com/HtmlUnit/htmlunit-neko/blob/76bb1d16575d02f443c037aa54ecc7395d91b7f2/src/main/java/org/htmlunit/cyberneko/HTMLScanner.java#L528-L531 [3] https://github.com/HtmlUnit/htmlunit-neko/pull/111/files#diff-c0ef5d78bafd2e2d032b106fd0ca50086966f97030bd6b6f4403d8f5ebe39f87R3011

rbri commented 6 months ago

Will have a look, maybe you can provide sample html files for the encodings like the one already added...

And thanks for all your efforts

atnak commented 6 months ago

maybe you can provide sample html files for the encodings like the one already added...

Since StandardEncodingTranslator.encodingNameFromLabel() is essentially an input string (HTTP encoding) to output string (Java encoding) convertor, I think a reasonable test would test input vs output to make sure results are expected. I've added one such test in StandardEncodingTranslatorTest.conversions() that tests significant cases. This probably makes the sample html files I provided before redundant.

Having said that, the MS932 sample html files I provided before did show the need for the second ENCODING_TO_IANA_ENCODING conversion step. We discovered the issue coincidentally because we're Japanese and have worked with Japanese encodings for a very long time. I cannot say the same for big5 (Chinese), euc-kr (Korean), x-mac-cyrillic / koi8-u (Ukrainian) or windows-874 (Thai) and don't have language knowledge or ability to create samples in those languages.

The best we can say at this point is that, using these specs in this order give results more or less consistent with the truths I'm aware of and seems to make sense:

  1. https://encoding.spec.whatwg.org/#names-and-labels
  2. https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming
  3. https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
rbri commented 6 months ago

one more test case added,

will have a look at the replacement stuff (again) - but there is already a test for this and it still passes

rbri commented 6 months ago

Hi @atnak, @duonglaiquang, have done some more work on this, PR is merged and i had to do a fix afterwards because one unit test failed. https://github.com/HtmlUnit/htmlunit-neko/commit/739250dbc162ccef6a1971a609ce70f6cedcf6b5 Hope you agree with me here.

Regarding the replacement charset (https://encoding.spec.whatwg.org/#replacement) i tried to fix the handling at some more places and also some more test are added.

Hope we are getting closer :-D

atnak commented 6 months ago

@rbri Thanks for fixing up the my bug in HTMLScanner and one of the existing tests. I've looked over the commits and they look good, as does the new replacement handling in HTMLScanner. 👍