101arrowz / fflate

High performance (de)compression in an 8kB package
https://101arrowz.github.io/fflate
MIT License
2.27k stars 79 forks source link

Broken surrogate pairs can't be converted back and forth with strToU8 -> strFromU8 #83

Closed fabiospampinato closed 3 years ago

fabiospampinato commented 3 years ago

How to reproduce

The following assertion fails:

const fflate = require ( 'fflate' );

const char = String.fromCodePoint ( 55296 );

console.assert ( fflate.strFromU8 ( fflate.strToU8 ( char ) ) === char );

The problem

I'm not sure if this is a bug exactly, I think ideally any string should be able to go through both strToU8 and then strFromU8 and exit unchanged, but the problem with the string I picked is that it comprises of a single high surrogate character, which isn't actually valid UTF-16 as there must be a low surrogate character immediately after that in the string, but JS doesn't throw so you can actually have strings like that.

fabiospampinato commented 3 years ago

Another potential bug:

const fflate = require ( 'fflate' );

const char = '\ufeff';

console.assert ( fflate.strFromU8 ( fflate.strToU8 ( char ) ) === char );

TextEncoder/TextDecoder work weirdly, I have no idea why they can't encode and decode \ufeff properly, that's not even a surrogate character.

fabiospampinato commented 3 years ago

It seems that the function used for converting uint8s to strings in Node works for \ufeff (https://github.com/feross/buffer/blob/795bbb5bda1b39f1370ebd784bea6107b087e3a7/index.js#L954), but it doesn't work (if that's actually an issue), with the single surrogate character.

101arrowz commented 3 years ago

Having a single surrogate character without a pair cannot yield valid UTF-8, but strToU8 guarantees valid UTF-8 so there isn't much to be done in this edge case. I guess I could clarify, but only valid UTF-16 input should be provided into the functions. Also, the TextDecoder thing works if you use new TextDecoder('utf-8', {ignoreBOM: true}), since \ufeff has a special meaning in UTF-8 when it's the first character: it specifies the byte order (little or big endian). It's called the byte order marker, or BOM, so only if you ignore it will it be visible. (This works only in Chromium AFAIK)

I'm not sure if there's anything that can really be "fixed" here since the main bug is dependent upon the input being invalid UTF-16, which should ideally throw an error but due to performance restrictions that can't be done easily.

fabiospampinato commented 3 years ago

I see, I'm closing this then.

Thanks for the mention of the ignoreBOM option, I totally overlooked options of TextDecoder.

which should ideally throw an error but due to performance restrictions that can't be done easily.

Yeah it's probably not worth it to make it slower for everyone just to potentially fail with a clean error if somebody uses that with invalid UTF-16, which will basically never happen anyway.