albfernandez / juniversalchardet

Originally exported from code.google.com/p/juniversalchardet
Other
333 stars 59 forks source link

US-ASCII is too generous #33

Closed nedtwigg closed 4 years ago

nedtwigg commented 4 years ago

Great library! I've used 2.3.0 for a long time, just upgraded to 2.3.1. At least in my test suites, it caused some pretty big regressions. Some examples:

input bytes 2.3.0 2.3.1
"ab".getBytes(UTF-16LE) UTF-16LE US-ASCII
"ab".getBytes(UTF-16BE) UTF-16BE US-ASCII
new byte[]{0x50, 0x4b, 0x03, 0x04, 0x14, 0x00, 0x02, 0x00} (zip header) null US-ASCII

These are corner cases, and it's an inherently imprecise problem, no worries if you decide not to support them. I'm very happy with 2.3.0 as is!

A point on semver: adding a new feature is always riskier than "something was broken, so we fixed it". A nice thing about semver is not just breaking changes in the major version, but also risk tolerance in the minor and bugfix versions. I think it would have been better to call 2.3.1 as 2.4.0, since it added a significant new feature which (in my case) happened to break some things, which isn't too surprising for significant new feature, but I was surprised since I had assumed it was purely a fix.

albfernandez commented 4 years ago

The idea was to return "US-ASCII" in cases where we were returning null and "only ASCII characters" where in the input. For this reason it was only a minor release ... but the implementation has a bug, and take all possible (and usually unused) ASCII chars, from 0 to 128. Now we check to have only printable ASCII chars, new lines and tabs.

It works for your zip header example, but for UTF16 returns null, I think it's good because there are only 4 bytes. Can you provide the sample code you use to get this results in 2.3.0?

If you have some test we can incorporate to juniversalchardet yours pull requests are welcome.

nedtwigg commented 4 years ago

Sounds good! There is a lot of code between juniversalchardet and my test cases that broke, returning null for very short strings seems reasonable (better than ASCII!), it's possible that we are doing the detection there. We had a ton of tests break, I just reported the ones that seemed easy to make the point.

Am I correct that the intent was "if we think the input is an ASCII-compatible one-byte-per-character (e.g. UTF-8, ISO-8859), and every character we've seen so far is just ASCII, then rather than guess that it's one ASCII-compatible versus another, we'll just report that it is ASCII, and then the user can decide "we'll always treat ASCII as UTF-8" or "we'll always treat ASCII as ISO-8859-1?"

The idea was... For this reason it was only a minor release... but the implementation has a bug

That's exactly the risk-tolerance reasoning for bumping added in breaking.added.fixed, unless the changes are limited to only "we know x is broken, and the only point of the change is to fix this known-broken thing." Not a big deal, very thankful for your project :)

albfernandez commented 4 years ago

I've released the 2.3.2 version with the changes. Please check if it works for you.

nedtwigg commented 4 years ago

Works great! I like the new behavior. It did break a few assumptions our code made before, but having now fixed them the end result we can get with 2.3.2 is better than we could achieve with 2.3.0. Thanks very much!