ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.04k stars 282 forks source link

fix: completing the InternalEncoder for utf8 #282

Closed yosion-p closed 2 years ago

yosion-p commented 2 years ago

Hi bro,It's me, again.🤣 I have completed the InternalEncoder for utf8 with surrogates after I read what you meant(#275 ).

275 looked a bit messy, so I submitted a brand new one,

I didn't know much before, but now I know more. I think surrogates should be in pairs always. Thank you for your patience.

yosion-p commented 2 years ago

I think I got it,

the merged output of iconv-lite encoding must be equal to Node's encoding of the whole string (Buffer.from(S, "utf8")).

sorry,I thought incomplete surrogate pairs were invalid and could be filtered out. I will correct all outputs.

But about that:

a loop over the whole string is pretty inefficient. Moreover, you don't need to touch internal surrogates - Node processes them correctly already. We only need to adjust boundaries, as these are what Node can't adjust as it doesn't know about them.

I don't really understand, Because if I don't loop through the string, I won't know which character in the string needs to be processed.

yosion-p commented 2 years ago

image

I was wondering if there was something wrong with the test method,not Encoder.

Just like I thought before, If UT into multiple strings in the input,IconvLiteEncoderStream.prototype._transform will be invoked many times, back to characters be cut by Encoder rather than a complete surrogates pair.

If we modify InternalEncoder, When the surrogates pair was cut, The first call doesn't return anything, only the second one returns a full surrogates pair,

That's why I need to go through a loop and check for characters one by one to see if it is a surrogate.

yosion-p commented 2 years ago

I think I can solve this problem in InternalEncoderUtf8.prototype.end(), I will try.

yosion-p commented 2 years ago

Hi bro,It's me, again.🤣 I did it by modifying the code of InternalEncoderUtf8.prototype.write. In a word, less is more.

yosion-p commented 2 years ago

have a look😃

yosion-p commented 2 years ago

Aha i tried to make it short, but didn't work it out.

nit: typo in work "surrogates"

yosion-p commented 2 years ago

Almost there, thank you for persistence!

Victory lies ahead!

yosion-p commented 2 years ago

Hi Buddy,

I see all the tests have the same structure ....

Now i realize actually you made it very clear, it's me misunderstand it, I'm terribly sorry…

make sure the function does not throw if str is empty.

About what you said, I have my own interpretation, I added one condition judgment was also in the first line of InternalEncoderUtf8.prototype.write , then I removed.

I think we still need to consider the incoming string is empty, as internal. js line: 81 InternalEncoder.prototype.write, otherwise UT will report an error, like I just submitted for the first time.

Here's what I thought I'd do:

if (str) {
    if (this.lowSurrogate) {
        str = this.lowSurrogate + str;
        this.lowSurrogate = '';
    }

    var charCode = str.charCodeAt(str.length - 1);
    if (55296 < charCode && charCode <= 56319) {
        this.lowSurrogate = str[str.length - 1];
        str = str.slice(0, str.length - 1);
    }
}
return Buffer.from(str, this.enc);

But it has nested conditional judgments, so, I did something like this:

if (!str) return Buffer.from('', this.enc);;

if (this.lowSurrogate) {
    str = this.lowSurrogate + str;
    this.lowSurrogate = '';
}

var charCode = str.charCodeAt(str.length - 1);
if (55296 < charCode && charCode <= 56319) {
    this.lowSurrogate = str[str.length - 1];
    str = str.slice(0, str.length - 1);
}

return Buffer.from(str, this.enc);

Although there are two Buffer.from in it.

ashtuchkin commented 2 years ago

I agree, we need to return an empty buffer, not undefined, when str is empty.

I've made some final minor adjustments (e.g. it's actually a high surrogate, not low :) and will merge. Thank you for your contribution!

yosion-p commented 2 years ago

Thank you buddy! It's a milestone for me the first time my code got merged in an open source project.

Even though my pointcut and idea are right since the beginning, but the process was kind of tortuous.

Maybe I need to pay more attention to the format and performance of the code,

and try to get your full point.lol

In short, thanks a lot for your patience and encouragement, respect!

Best Regards, Yosion.

At 2021-09-19 01:55:57, "Alexander Shtuchkin" @.***> wrote:

I agree, we need to return an empty buffer, not undefined, when str is empty.

I've made some final minor adjustments (e.g. it's actually a high surrogate, not low :) and will merge. Thank you for your contribution!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ashtuchkin commented 2 years ago

You did great, keep it up! It all comes with practice, don't worry too much.