Open hexane360 opened 4 years ago
The YAML spec, section 9.1.1 states:
A document may be preceded by a prefix specifying the character encoding, and optional comment lines. Note that all documents in a stream must use the same character encoding. However it is valid to re-specify the encoding using a byte order mark for each document in the stream. This makes it easier to concatenate streams.
The existence of the optional prefix does not necessarily indicate the existence of an actual document.
I think yaml-rust
should eat up and ignore the \u{feff}
codepoint at the beginning of each document (and not in other places).
I noticed this issue while evaluating yaml-rust; see demo https://github.com/mkmik/rustyamltest which contains code that reproduces the BOM issue clearly.
Note also section 5.2.
On input, a YAML processor must support the UTF-8 and UTF-16 character encodings.
Byte 0 Byte 1 Byte 2 Byte 3 Encoding Explicit BOM #x00 #x00 #xFE #xFF UTF-32BE ASCII first character #x00 #x00 #x00 any UTF-32BE Explicit BOM #xFF #xFE #x00 #x00 UTF-32LE ASCII first character any #x00 #x00 #x00 UTF-32LE Explicit BOM #xFE #xFF UTF-16BE ASCII first character #x00 any UTF-16BE Explicit BOM #xFF #xFE UTF-16LE ASCII first character any #x00 UTF-16LE Explicit BOM #xEF #xBB #xBF UTF-8 Default UTF-8
It looks like the correct behavior is to parse as UTF-16, not to strip and assume UTF-8 (although maybe yaml-rust could make this an option).
Indeed, but I thought to file a separate issue for that (as it would likely mean a new API). If I understood correctly, the current API takes a rust string, which is already decoded before reaching yaml-rust (IIRC rust strings are utf-8 in memory). Thus, handling the BOM codepoint (which is a perfectly valid unicode code point called Zero width non break space) in the input string would still make sense when dealing with the string based API.
Forked the thread about encoding detection in #155.
In certain situations, yaml-rust fails to handle strings starting with a UTF-8 encoded byte order mark.
Plenty of software (such as notepad.exe) inserts BOMs automatically, so I think it's something yaml-rust should be aware of.
Currently, yaml-rust scans like this:
All but the last one parse successfully.
I'm not sure exactly what the correct behavior is here, but I don't think we can push the burden of sanitization onto users. Thoughts?