Closed mkronschnabl closed 7 years ago
right - I was waiting to see how long it would take to encounter this problem,
The solution is to use this http://nodejs.org/api/string_decoder.html to parse the utf8 characters. This will take a bit of refactoring - can you make a pull request to add a failing test case?
I have added the test partitioned_unicode.js.
Hi @dominictarr ,
I noticed this problem too.
Use Buffer.concat
instead of +
may solve the problem, but maybe slow.
string_decoder
may help but it doesn't work for other encoding like GBK and BIG5.
I've never heard of those encodings, how widely used are they compared to unicode?
It's not so widely used now but still exist in some old system in China.
Ah okay. hmm, there is a this module: https://npmjs.org/package/iconv-lite but I think it would need to be updated to handle multibyte encodings.
So, the simplest way would be to split the lines before you have decoded them,
hopefully, you don't have bytes within a multibyte encoding which == '\n'.chatAt(0)
So, I think you could just use @maxogden's https://github.com/maxogden/binary-split and then decode with utf8, gbk, big5 or whatever.
Should this be closed considering the original issue has already been fixed?
We work initially with a tar stream that generates fixed length chunks. This can lead to partitioned unicode characters, e.g.:
And if these chunks are used with the split library they are not fit together properly:
At the moment we have no solution for this problem, maybe buffer concatenation can help:
Do you think there is a way to cope with this problem?