ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.07k stars 282 forks source link

Re-check internal codec's "trivial" encoder on surrogates #250

Closed ashtuchkin closed 3 years ago

ashtuchkin commented 4 years ago

I have a suspicion that it might not be so trivial. Test with 2 string chunks, where the first one ends on a high surrogate and the second begins with low surrogate.

ashtuchkin commented 4 years ago

in streams-test.js

    it("Encoding using internal modules: utf8 with surrogates in separate chunks", checkEncodeStream({
        encoding: "utf8",
        input: ["\uD83D", "\uDE3B"],
        output: "f09f98bb",
    }));
yosion-p commented 3 years ago

I suppose the input method of high surrogate and low surrogate is not supported by iconv-lite yet, right?Should i regard it as a new feat to develop?

ashtuchkin commented 3 years ago

Only for UTF-8 (as we're using Node's native utf-8 encoder). All the main encodings do support surrogate pairs in both encoding and decoding methods.

yosion-p commented 3 years ago

I thought that high surrogate and low surrogate were only used in utf-16.

ashtuchkin commented 3 years ago

They are. Javascript strings are internally utf-16, so when you encode a stream of js strings to e.g. a stream of utf-8 encoded buffers AND it so happens that the surrogates are split between chunks, then it could lead to an incorrect result. To be fair, this issue is not iconv-lite-specific, node itself does this too.

On Mon, Aug 9, 2021 at 2:58 AM yosion-p @.***> wrote:

I thought that high surrogate and low surrogate were only used in utf-16.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/250#issuecomment-894995794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHONHSIUMDHWLQKB4VLT354APANCNFSM4OKLFU2Q .

yosion-p commented 3 years ago

do you have any ideas to solve this issue? will it be more troublesome? after all, this may involve the underlying logic of nodejs.

ashtuchkin commented 3 years ago

I'd have to either reimplement utf-8 codec or add some logic to check the chunk boundaries before encoding. Should not be too hard, but I haven't got any community interest before, so focused on other things. Also its likely to cause a perf hit, so not clear if it'd be net positive.

Why are you interested btw?

On Tue, Aug 10, 2021, 04:40 yosion-p @.***> wrote:

do you have any ideas to solve this issue? will it be more troublesome? after all, this may involve the underlying logic of nodejs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/250#issuecomment-895843546, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHKDHV4CXIOW3VG7UF3T4DQWHANCNFSM4OKLFU2Q .

yosion-p commented 3 years ago

I found that the GitHub community is an interesting thing when I submitted the code on GitHub, I will make some contribution to it when I'm not busy.

yosion-p commented 3 years ago

I think the main point to resolve this issue is i'm not sure if it's surrogate pair, or even unsure what kind of coding type.
I wonder if there are two solutions: A. To add a parameter to describe the code type, but it would be "breaking change" B. Can we judge and focus the input code is utf-8 or utf-16 surrogate pair, then it will be much easier. But the challenge is how to judge. I wonder if it could be by checking the chartable, it will lead to perf hit certainly.

I'd have to either reimplement utf-8 codec or add some logic to check the chunk boundaries before encoding. Should not be too hard, but I haven't got any community interest before, so focused on other things. Also its likely to cause a perf hit, so not clear if it'd be net positive. Why are you interested btw? On Tue, Aug 10, 2021, 04:40 yosion-p @.***> wrote: do you have any ideas to solve this issue? will it be more troublesome? after all, this may involve the underlying logic of nodejs. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#250 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHKDHV4CXIOW3VG7UF3T4DQWHANCNFSM4OKLFU2Q .

ashtuchkin commented 3 years ago

I'm not sure what you're talking about here. JS strings are always UTF-16, no need to guess the encoding.

yosion-p commented 3 years ago

Sorry brother, i have ignored it, shamed... BTW, i appreciate very much that you make the point clear to me.

yosion-p commented 3 years ago

I checked some information about the boundary value of surrogate pair Now, my code works for this test case.

275

add some logic to check the chunk boundaries before encoding.