this.feed = function(aBuf) {
this._mFullLen = aBuf.length; // FullLen should be +=
And the confidence function now ignores all basic ASCII character (code <= 127), because many encoding methods encodes them in the same way but extended ASCII code may have different behavior (such as Ā in UTF-8 and Windows-1252).
Ignoring basic ASCII characters MAY increase confidence for a multi-byte character document. But the problem now is low UTF-8 positive rate rather than high UTF-8 false positive rate. I write some short tests locally and found text encoding by other methods like Windows-1252 will be detected correctly, the UTF-8 prober never triggered, and the tests also pass as is. So I think this trade is worth it.
This PR fixes a wrong fullLen calculation:
And the confidence function now ignores all basic ASCII character (code <= 127), because many encoding methods encodes them in the same way but extended ASCII code may have different behavior (such as
Ā
in UTF-8 and Windows-1252).Encoding reference: https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec
Ideally, this functionality should ensure:
Ignoring basic ASCII characters MAY increase confidence for a multi-byte character document. But the problem now is low UTF-8 positive rate rather than high UTF-8 false positive rate. I write some short tests locally and found text encoding by other methods like Windows-1252 will be detected correctly, the UTF-8 prober never triggered, and the tests also pass as is. So I think this trade is worth it.