Closed dwolter closed 1 year ago
This occurs even if we specify the correct externa-format (:utf-8).
cl-user> ! od -t x1 ~/src-utf-8.lisp
0000000 ef bb bf 28 64 65 66 76 61 72 20 2a 66 6f 6f 2a
0000020 20 27 68 69 29 0a 0a
0000027
; No value
cl-user> (load #P"~/src-utf-8.lisp")
; Evaluation aborted on #<unbound-variable #x30200706355D>.
cl-user> (load #P"~/src-utf-8.lisp" :external-format :utf-8)
; Evaluation aborted on #<unbound-variable #x30200708060D>.
cl-user>
IMO, this is something that must be managed at the level of the encoding/decoding, ie. external-format, but understandably, this opens a small can of worms, (what to do with BOMs in the middle of files? what about concatenations? etc). (That would suggest a feature request/improvement).
You can deal with it as suggested in the manual, by having:
(defun ignore-bom (stream ch) (declare (ignore stream ch)) nil)
(set-macro-character #\U+FEFF 'ignore-bom)
(set-macro-character #\U+FFFE 'ignore-bom)
in your rc file.
Thanks for the prompt reply and pointing me to workaround from the manual that I missed! It would be nice to see the issue resolved or warnings printed (for naive users like me who thought by UTF we had overcome text encoding issues).
UTF-8 always has the same byte order, so starting UTF-8 data with a BOM (byte order mark) is not terribly useful.
One major selling point of UTF-8 is that it is ASCII-compatible. Programs expecting ASCII will certainly not know what to do with a BOM.
I think the phrase "It is probably a good idea to skip over this character" is a good example of the manual trying to be humorous in a wry way. :-)
When trying to load a file or read from a file saved in UTF-8 starting with UTF-8 BOM 0xEF 0xBB 0xBF, first a symbol is read whose name is a single char with char code 65279, i.e., 0xfeff (UTF-16 (BE) BOM). In case of loading, an undefined variable error is signalled. This behaviour is both unexpected (reading UTF-16 BOM instead of UTF-8 BOM) and problematic (load error).
possible fix: The manual (chapter 4.5) already suggests that "A byte order mark from a UTF-8 encoded input stream is not treated specially and just appears as a normal character from the input stream. It is probably a good idea to skip over this character."