Closed lingsamuel closed 4 years ago
Sorry I haven't commented on this yet, I've been a bit busy lately and thought I could do it this weekend but then life stuff came in. I'll try to check it this week or failing that, the weekend.
Hello @lingsamuel!
Thank you for fixing this. Can you, please, also include some additional test cases to make sure the fix you've made will not get broken again? :)
Added.
This sounds good to me. If I got it right this changes the logic that guards the decrease of confidence to only apply when 60% of the characters were multibyte. Any particular reason for the 60%? just a bit more than half? :D
I also made a little modification to the way the confidence is decreased, I changed from unlike *= ONE_CHAR_PROB
to unlike *= Math.pow(ONE_CHAR_PROB, this._mNumOfMBChar);
which is what the python chardet does right now. This passes your 3rd test (because it increases the unlikeness there as it has 3 chars) but not the other two so it still makes sense to apply the logic you're suggesting here. Let's see how this goes and then we can adjust that % if needed.
I've just merged this, but since I've made that additional change it created a new commit, maybe I should have done it in a different commit, but now it's done.
60% is a randomly picked magic number. I really don't think just a fixed number ratio is a good idea, but it's simple enough to implement.
As I said,
Maybe the proper way is ignore all ASCII characters?
I think all basic (without extension) ASCII characters should be excepted during the ratio calculation. I know file encoding little but I believe many encoding methods encode ASCII in the same way, they actually provide nothing useful during charset detection.
But I have no time to implement it, so I just leave the idea here.
Ehh, I found a bug in this commit:
this.feed = function(aBuf) {
this._mFullLen = aBuf.length;
It should be:
this.feed = function(aBuf) {
this._mFullLen += aBuf.length; // should be plus
I made a PR #59 fixes this and provides basic ASCII ignoring functionality.
Hello, @aadsm!
I also made a little modification to the way the confidence is decreased, I changed from
unlike *= ONE_CHAR_PROB
tounlike *= Math.pow(ONE_CHAR_PROB, this._mNumOfMBChar);
which is what the python chardet does right now
I don't understand where this change came from. In the python version, the ONE_CHAR_PROB
is raised to a power of this._mNumOfMBChar
, replacing the loop: https://github.com/chardet/chardet/blob/4291399777cd6f011bfdab5da678ecbea8f10855/chardet/utf8prober.py#L78-L79
@rstm-sf Thank you! This was a clear oversight by me, forgot to remove the loop :(.
This fixes short UTF-8 text detection. Fixes #56. I guess the magic number 6 can avoid some wrong multibyte character detection, but it actually breaks short UTF-8 text detection. So I add a simple ratio condition to it. By the way,
0.6
is another magic number...Note: this can't fix such case:
because the multibyte character ratio too low. LatinProber gives 0.95 confidence instead.
Maybe the proper way is ignore all ASCII characters? I am not sure.
Ref: https://github.com/microsoft/vscode/issues/33720