Closed ashtuchkin closed 3 years ago
in streams-test.js
it("Encoding using internal modules: utf8 with surrogates in separate chunks", checkEncodeStream({
encoding: "utf8",
input: ["\uD83D", "\uDE3B"],
output: "f09f98bb",
}));
I suppose the input method of high surrogate and low surrogate is not supported by iconv-lite yet, right?Should i regard it as a new feat to develop?
Only for UTF-8 (as we're using Node's native utf-8 encoder). All the main encodings do support surrogate pairs in both encoding and decoding methods.
I thought that high surrogate and low surrogate were only used in utf-16.
They are. Javascript strings are internally utf-16, so when you encode a stream of js strings to e.g. a stream of utf-8 encoded buffers AND it so happens that the surrogates are split between chunks, then it could lead to an incorrect result. To be fair, this issue is not iconv-lite-specific, node itself does this too.
On Mon, Aug 9, 2021 at 2:58 AM yosion-p @.***> wrote:
I thought that high surrogate and low surrogate were only used in utf-16.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/250#issuecomment-894995794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHONHSIUMDHWLQKB4VLT354APANCNFSM4OKLFU2Q .
do you have any ideas to solve this issue? will it be more troublesome? after all, this may involve the underlying logic of nodejs.
I'd have to either reimplement utf-8 codec or add some logic to check the chunk boundaries before encoding. Should not be too hard, but I haven't got any community interest before, so focused on other things. Also its likely to cause a perf hit, so not clear if it'd be net positive.
Why are you interested btw?
On Tue, Aug 10, 2021, 04:40 yosion-p @.***> wrote:
do you have any ideas to solve this issue? will it be more troublesome? after all, this may involve the underlying logic of nodejs.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/250#issuecomment-895843546, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHKDHV4CXIOW3VG7UF3T4DQWHANCNFSM4OKLFU2Q .
I found that the GitHub community is an interesting thing when I submitted the code on GitHub, I will make some contribution to it when I'm not busy.
I think the main point to resolve this issue is i'm not sure if it's surrogate pair, or even unsure what kind of coding type.
I wonder if there are two solutions:
A. To add a parameter to describe the code type, but it would be "breaking change"
B. Can we judge and focus the input code is utf-8 or utf-16 surrogate pair, then it will be much easier. But the challenge is how to judge. I wonder if it could be by checking the chartable, it will lead to perf hit certainly.
I'd have to either reimplement utf-8 codec or add some logic to check the chunk boundaries before encoding. Should not be too hard, but I haven't got any community interest before, so focused on other things. Also its likely to cause a perf hit, so not clear if it'd be net positive. Why are you interested btw? … On Tue, Aug 10, 2021, 04:40 yosion-p @.***> wrote: do you have any ideas to solve this issue? will it be more troublesome? after all, this may involve the underlying logic of nodejs. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#250 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHKDHV4CXIOW3VG7UF3T4DQWHANCNFSM4OKLFU2Q .
I'm not sure what you're talking about here. JS strings are always UTF-16, no need to guess the encoding.
Sorry brother, i have ignored it, shamed... BTW, i appreciate very much that you make the point clear to me.
I checked some information about the boundary value of surrogate pair Now, my code works for this test case.
add some logic to check the chunk boundaries before encoding.
I have a suspicion that it might not be so trivial. Test with 2 string chunks, where the first one ends on a high surrogate and the second begins with low surrogate.