aadsm / jschardet

Character encoding auto-detection in JavaScript (port of python's chardet)
GNU Lesser General Public License v2.1
714 stars 97 forks source link

Fix UTF-8 prober fullLen calculation, ignores basic ASCII characters #59

Closed lingsamuel closed 4 years ago

lingsamuel commented 4 years ago

This PR fixes a wrong fullLen calculation:

this.feed = function(aBuf) {
        this._mFullLen = aBuf.length; // FullLen should be +=

And the confidence function now ignores all basic ASCII character (code <= 127), because many encoding methods encodes them in the same way but extended ASCII code may have different behavior (such as Ā in UTF-8 and Windows-1252).

Encoding reference: https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec

Ideally, this functionality should ensure:

  1. Won't increase the false positive rate
  2. Increase positive rate

Ignoring basic ASCII characters MAY increase confidence for a multi-byte character document. But the problem now is low UTF-8 positive rate rather than high UTF-8 false positive rate. I write some short tests locally and found text encoding by other methods like Windows-1252 will be detected correctly, the UTF-8 prober never triggered, and the tests also pass as is. So I think this trade is worth it.