Multi-byte character ratio detection in UTF-8 prober confidence function

aadsm / jschardet

Character encoding auto-detection in JavaScript (port of python's chardet)

GNU Lesser General Public License v2.1

710 stars 97 forks source link

Multi-byte character ratio detection in UTF-8 prober confidence function #57

Closed lingsamuel closed 4 years ago

lingsamuel commented 4 years ago

This fixes short UTF-8 text detection. Fixes #56. I guess the magic number 6 can avoid some wrong multibyte character detection, but it actually breaks short UTF-8 text detection. So I add a simple ratio condition to it. By the way, 0.6 is another magic number...

Note: this can't fix such case:

const UTF8Prober = require('./src/utf8prober');
function utf8test(str){
    let prober = (new UTF8Prober());
    prober.feed(Buffer.from(str).toString('binary'));
    console.log(prober.getConfidence());
}
utf8test("aaaaaaaaaaaaaaaaaaaaaaaaaaaa中文");
// confidence is 0.7525

because the multibyte character ratio too low. LatinProber gives 0.95 confidence instead.

Maybe the proper way is ignore all ASCII characters? I am not sure.

Ref: https://github.com/microsoft/vscode/issues/33720

aadsm commented 4 years ago

Sorry I haven't commented on this yet, I've been a bit busy lately and thought I could do it this weekend but then life stuff came in. I'll try to check it this week or failing that, the weekend.

gyzerok commented 4 years ago

Hello @lingsamuel!

Thank you for fixing this. Can you, please, also include some additional test cases to make sure the fix you've made will not get broken again? :)

lingsamuel commented 4 years ago

Added.

aadsm commented 4 years ago

This sounds good to me. If I got it right this changes the logic that guards the decrease of confidence to only apply when 60% of the characters were multibyte. Any particular reason for the 60%? just a bit more than half? :D

I also made a little modification to the way the confidence is decreased, I changed from unlike *= ONE_CHAR_PROB to unlike *= Math.pow(ONE_CHAR_PROB, this._mNumOfMBChar); which is what the python chardet does right now. This passes your 3rd test (because it increases the unlikeness there as it has 3 chars) but not the other two so it still makes sense to apply the logic you're suggesting here. Let's see how this goes and then we can adjust that % if needed.

I've just merged this, but since I've made that additional change it created a new commit, maybe I should have done it in a different commit, but now it's done.

lingsamuel commented 4 years ago

60% is a randomly picked magic number. I really don't think just a fixed number ratio is a good idea, but it's simple enough to implement.

As I said,

Maybe the proper way is ignore all ASCII characters?

I think all basic (without extension) ASCII characters should be excepted during the ratio calculation. I know file encoding little but I believe many encoding methods encode ASCII in the same way, they actually provide nothing useful during charset detection.

But I have no time to implement it, so I just leave the idea here.

lingsamuel commented 4 years ago

Ehh, I found a bug in this commit:

    this.feed = function(aBuf) {
        this._mFullLen = aBuf.length;

It should be:

    this.feed = function(aBuf) {
        this._mFullLen += aBuf.length; // should be plus

I made a PR #59 fixes this and provides basic ASCII ignoring functionality.

rstm-sf commented 3 years ago

Hello, @aadsm!

I also made a little modification to the way the confidence is decreased, I changed from unlike *= ONE_CHAR_PROB to unlike *= Math.pow(ONE_CHAR_PROB, this._mNumOfMBChar); which is what the python chardet does right now

I don't understand where this change came from. In the python version, the ONE_CHAR_PROB is raised to a power of this._mNumOfMBChar, replacing the loop: https://github.com/chardet/chardet/blob/4291399777cd6f011bfdab5da678ecbea8f10855/chardet/utf8prober.py#L78-L79

aadsm commented 3 years ago

@rstm-sf Thank you! This was a clear oversight by me, forgot to remove the loop :(.