Problem with partitioned unicode characters

mkronschnabl commented 10 years ago

We work initially with a tar stream that generates fixed length chunks. This can lead to partitioned unicode characters, e.g.:

var buffer, complete, piece1, piece2, pieces, soFar, testBuffer;

testBuffer = new Buffer('テスト試験今日はとても,よい天気で');

piece1 = testBuffer.slice(0, 20);
piece2 = testBuffer.slice(20, testBuffer.length);

// 日 is partitioned
console.log(piece1.toString()); // テスト試験今��
console.log(piece2.toString()); // �はとても,よい天気で

And if these chunks are used with the split library they are not fit together properly:

soFar = piece1.toString();
buffer = piece2;

pieces = (soFar + buffer).split(/,/g);

// the resulting unicode string is corrupt due to the string concatenation
console.log(pieces); // [ 'テスト試験今���はとても', 'よい天気で' ]

At the moment we have no solution for this problem, maybe buffer concatenation can help:

complete = Buffer.concat([piece1, piece2]);
console.log(complete.toString()); // テスト試験今日はとても,よい天気で

Do you think there is a way to cope with this problem?

dominictarr commented 10 years ago

right - I was waiting to see how long it would take to encounter this problem,

The solution is to use this http://nodejs.org/api/string_decoder.html to parse the utf8 characters. This will take a bit of refactoring - can you make a pull request to add a failing test case?

mkronschnabl commented 10 years ago

I have added the test partitioned_unicode.js.

xingrz commented 10 years ago

Hi @dominictarr ,

I noticed this problem too.

Use Buffer.concat instead of + may solve the problem, but maybe slow.

string_decoder may help but it doesn't work for other encoding like GBK and BIG5.

dominictarr commented 10 years ago

I've never heard of those encodings, how widely used are they compared to unicode?

xingrz commented 10 years ago

It's not so widely used now but still exist in some old system in China.

dominictarr commented 10 years ago

Ah okay. hmm, there is a this module: https://npmjs.org/package/iconv-lite but I think it would need to be updated to handle multibyte encodings.

So, the simplest way would be to split the lines before you have decoded them, hopefully, you don't have bytes within a multibyte encoding which == '\n'.chatAt(0)

So, I think you could just use @maxogden's https://github.com/maxogden/binary-split and then decode with utf8, gbk, big5 or whatever.

frosas commented 7 years ago

Should this be closed considering the original issue has already been fixed?

dominictarr / split

Problem with partitioned unicode characters #5