Closed atnak closed 6 months ago
Will have a look, maybe you can provide sample html files for the encodings like the one already added...
And thanks for all your efforts
maybe you can provide sample html files for the encodings like the one already added...
Since StandardEncodingTranslator.encodingNameFromLabel()
is essentially an input string (HTTP encoding) to output string (Java encoding) convertor, I think a reasonable test would test input vs output to make sure results are expected. I've added one such test in StandardEncodingTranslatorTest.conversions()
that tests significant cases. This probably makes the sample html files I provided before redundant.
Having said that, the MS932 sample html files I provided before did show the need for the second ENCODING_TO_IANA_ENCODING
conversion step. We discovered the issue coincidentally because we're Japanese and have worked with Japanese encodings for a very long time. I cannot say the same for big5 (Chinese), euc-kr (Korean), x-mac-cyrillic / koi8-u (Ukrainian) or windows-874 (Thai) and don't have language knowledge or ability to create samples in those languages.
The best we can say at this point is that, using these specs in this order give results more or less consistent with the truths I'm aware of and seems to make sense:
one more test case added,
will have a look at the replacement stuff (again) - but there is already a test for this and it still passes
Hi @atnak, @duonglaiquang, have done some more work on this, PR is merged and i had to do a fix afterwards because one unit test failed. https://github.com/HtmlUnit/htmlunit-neko/commit/739250dbc162ccef6a1971a609ce70f6cedcf6b5 Hope you agree with me here.
Regarding the replacement charset (https://encoding.spec.whatwg.org/#replacement) i tried to fix the handling at some more places and also some more test are added.
Hope we are getting closer :-D
@rbri Thanks for fixing up the my bug in HTMLScanner and one of the existing tests. I've looked over the commits and they look good, as does the new replacement handling in HTMLScanner. 👍
This PR does the following
This PR firstly attempts to fix regressions introduced in #110 by restoring these encodings:
EncodingMap.fIANA2JavaMap
EncodingMap.fIANA2JavaMap
big5-hkscs
was not supported inEncodingMap.fIANA2JavaMap
windows-949
was not supported inEncodingMap.fIANA2JavaMap
EncodingMap.fIANA2JavaMap
(obviously)To do this, this PR removes the troublesome dependency on
EncodingMap.fIANA2JavaMap
and internally definesIANA_TO_JAVA_ENCODINGS
.In doing so, this PR also adds support to these encodings and labels that were not previously supported.
Finally, this PR attempts to fix parts of
HTMLScanner
that will be confused byStandardEncodingTranslator.encodingNameFromLabel()
returning a non-null value which is not a valid Java encoding. **** With the exception of "replacement", see following heading.
This PR does not do the following
The following appears to be an issue but I didn't attempt to fix this due to limited knowledge:
The word
"replacement"
can make its way[1] intoHTMLScanner.fJavaEncoding
even though this field seems to be used as a sanitised source[2]. I don't know how to fix this or what we want to do with "replacement" so I simply let it through[3] in this PR.[1] https://github.com/HtmlUnit/htmlunit-neko/blob/76bb1d16575d02f443c037aa54ecc7395d91b7f2/src/main/java/org/htmlunit/cyberneko/HTMLScanner.java#L3019 [2] https://github.com/HtmlUnit/htmlunit-neko/blob/76bb1d16575d02f443c037aa54ecc7395d91b7f2/src/main/java/org/htmlunit/cyberneko/HTMLScanner.java#L528-L531 [3] https://github.com/HtmlUnit/htmlunit-neko/pull/111/files#diff-c0ef5d78bafd2e2d032b106fd0ca50086966f97030bd6b6f4403d8f5ebe39f87R3011