UTF-8 /BOM issues: load fails when UTF-8 source code file starts with byte order mark (BOM)

dwolter commented 2 years ago

When trying to load a file or read from a file saved in UTF-8 starting with UTF-8 BOM 0xEF 0xBB 0xBF, first a symbol is read whose name is a single char with char code 65279, i.e., 0xfeff (UTF-16 (BE) BOM). In case of loading, an undefined variable error is signalled. This behaviour is both unexpected (reading UTF-16 BOM instead of UTF-8 BOM) and problematic (load error).

possible fix: The manual (chapter 4.5) already suggests that "A byte order mark from a UTF-8 encoded input stream is not treated specially and just appears as a normal character from the input stream. It is probably a good idea to skip over this character."

informatimago commented 2 years ago

This occurs even if we specify the correct externa-format (:utf-8).

cl-user> ! od -t x1 ~/src-utf-8.lisp
0000000    ef  bb  bf  28  64  65  66  76  61  72  20  2a  66  6f  6f  2a
0000020    20  27  68  69  29  0a  0a                                    
0000027
; No value
cl-user> (load #P"~/src-utf-8.lisp")
; Evaluation aborted on #<unbound-variable #x30200706355D>.
cl-user> (load #P"~/src-utf-8.lisp" :external-format :utf-8)
; Evaluation aborted on #<unbound-variable #x30200708060D>.
cl-user>

IMO, this is something that must be managed at the level of the encoding/decoding, ie. external-format, but understandably, this opens a small can of worms, (what to do with BOMs in the middle of files? what about concatenations? etc). (That would suggest a feature request/improvement).

You can deal with it as suggested in the manual, by having:

(defun ignore-bom (stream ch) (declare (ignore stream ch)) nil)
(set-macro-character #\U+FEFF 'ignore-bom)
(set-macro-character #\U+FFFE 'ignore-bom)

in your rc file.

dwolter commented 2 years ago

Thanks for the prompt reply and pointing me to workaround from the manual that I missed! It would be nice to see the issue resolved or warnings printed (for naive users like me who thought by UTF we had overcome text encoding issues).

xrme commented 1 year ago

UTF-8 always has the same byte order, so starting UTF-8 data with a BOM (byte order mark) is not terribly useful.

One major selling point of UTF-8 is that it is ASCII-compatible. Programs expecting ASCII will certainly not know what to do with a BOM.

I think the phrase "It is probably a good idea to skip over this character" is a good example of the manual trying to be humorous in a wry way. :-)

Clozure / ccl

UTF-8 /BOM issues: load fails when UTF-8 source code file starts with byte order mark (BOM) #412