Fails to parse byte-order mark in certain situations

chyh1990 / yaml-rust

A pure rust YAML implementation.

Apache License 2.0

600 stars 139 forks source link

Fails to parse byte-order mark in certain situations #142

Open hexane360 opened 4 years ago

hexane360 commented 4 years ago

In certain situations, yaml-rust fails to handle strings starting with a UTF-8 encoded byte order mark.

Plenty of software (such as notepad.exe) inserts BOMs automatically, so I think it's something yaml-rust should be aware of.

Currently, yaml-rust scans like this:

"\u{feff}test: 5"        -> Scalar("\u{feff}test") Scalar("5")
"\u{feff}\n---\ntest: 5" -> Scalar("\u{feff}") DocumentStart Scalar("test") Scalar("5")
"\u{feff}---\ntest: 5"   -> ScanError("mapping values are not allowed in this context")

All but the last one parse successfully.

I'm not sure exactly what the correct behavior is here, but I don't think we can push the burden of sanitization onto users. Thoughts?

mkmik commented 4 years ago

The YAML spec, section 9.1.1 states:

A document may be preceded by a prefix specifying the character encoding, and optional comment lines. Note that all documents in a stream must use the same character encoding. However it is valid to re-specify the encoding using a byte order mark for each document in the stream. This makes it easier to concatenate streams.

The existence of the optional prefix does not necessarily indicate the existence of an actual document.

I think yaml-rust should eat up and ignore the \u{feff} codepoint at the beginning of each document (and not in other places).

I noticed this issue while evaluating yaml-rust; see demo https://github.com/mkmik/rustyamltest which contains code that reproduces the BOM issue clearly.

hexane360 commented 4 years ago

Note also section 5.2.

On input, a YAML processor must support the UTF-8 and UTF-16 character encodings.

Byte 0 Byte 1 Byte 2 Byte 3 Encoding

Explicit BOM #x00 #x00 #xFE #xFF UTF-32BE

ASCII first character #x00 #x00 #x00 any UTF-32BE

Explicit BOM #xFF #xFE #x00 #x00 UTF-32LE

ASCII first character any #x00 #x00 #x00 UTF-32LE

Explicit BOM #xFE #xFF UTF-16BE

ASCII first character #x00 any UTF-16BE

Explicit BOM #xFF #xFE UTF-16LE

ASCII first character any #x00 UTF-16LE

Explicit BOM #xEF #xBB #xBF UTF-8

Default UTF-8

	Byte 0	Byte 1	Byte 2	Byte 3	Encoding
Explicit BOM	#x00	#x00	#xFE	#xFF	UTF-32BE
ASCII first character	#x00	#x00	#x00	any	UTF-32BE
Explicit BOM	#xFF	#xFE	#x00	#x00	UTF-32LE
ASCII first character	any	#x00	#x00	#x00	UTF-32LE
Explicit BOM	#xFE	#xFF			UTF-16BE
ASCII first character	#x00	any			UTF-16BE
Explicit BOM	#xFF	#xFE			UTF-16LE
ASCII first character	any	#x00			UTF-16LE
Explicit BOM	#xEF	#xBB	#xBF		UTF-8
Default					UTF-8

It looks like the correct behavior is to parse as UTF-16, not to strip and assume UTF-8 (although maybe yaml-rust could make this an option).

mkmik commented 4 years ago

Indeed, but I thought to file a separate issue for that (as it would likely mean a new API). If I understood correctly, the current API takes a rust string, which is already decoded before reaching yaml-rust (IIRC rust strings are utf-8 in memory). Thus, handling the BOM codepoint (which is a perfectly valid unicode code point called Zero width non break space) in the input string would still make sense when dealing with the string based API.

mkmik commented 4 years ago

Forked the thread about encoding detection in #155.