ASCIIPropertyListParser: handle non-7b-ASCII chars

matvore commented 6 years ago

Currently, ASCIIPropertyListParser takes bytes[] and then pads the bytes with an extra 00 byte to get UTF-16. If the byte is >= 0x80, then it pads it with 0xff. This means that if the bytes are in the 7-bit ASCII range, everything is fine. But if not, 0x80 for example becomes 0xff80, (half-width TA katakana) which I don't believe corresponds to any real encoding system.

The options are to:

convert using the default system encoding
convert using UTF-8

I think UTF-8 is a better default. The default system encoding is good for backwards compatibility, but this feature (non-7-bit ASCII) has never worked at all before, so that's not really necessary. This can also be made configurable if the need presents itself.

3breadt commented 6 years ago

I didn't know char casting did that, that was not intended behavior.

So I redesigned the approach for parsing ASCII property list. It now works on a char array instead of a byte array. An encoding can be specified explicitly, otherwise the parser attempts to detect it (UTF-8, UTF-16, UTF-32 or ASCII). I created a feature branch for this reworked parser: https://github.com/3breadt/dd-plist/tree/asciipropertylist-configurable-encoding

What do you think?

matvore commented 6 years ago

That's great! That commit would definitely fit my requirements.

3breadt / dd-plist

ASCIIPropertyListParser: handle non-7b-ASCII chars #47