Closed jbouwman closed 2 months ago
cool tests i must say
Somewhere, somebody is losing track of the UTF-8'ness of the input stream. Given a 4-multibyte-character input file:
% od -cx input.txt
0000000 非 ** ** 常 ** ** に ** ** 重 ** ** \n
9de9 e59e b8b8 81e3 e9ab 8d87 000a
0000015
Built in reader (using :external-format, a SBCLism):
(with-open-file (stream "input.txt"
:external-format :utf-8)
(symbol-name (read stream)))
=> "非常に重"
Coalton reader gets offsets wrong:
(with-open-file (stream "input.txt"
:external-format :utf-8)
(cst:source (maybe-read-form stream)))
=> (0 . 12)
But string input streams are fine:
(with-input-from-string (stream "非常に重")
(cst:source (maybe-read-form stream)))
=> (0 . 4)
(defun read-c-c (file)
"read a file one character at a time"
(with-open-file (stream file)
(coerce (loop :for s := (read-char stream nil nil)
:while s
:collect s)
'string)))
(with-input-from-string (stream (read-c-c "input.txt"))
(cst:source (maybe-read-form stream)))
=> (0 . 4)
Remarkably, using flexi-streams to translate to UTF-8 produces worse results:
(with-open-file (stream "input.txt" :element-type 'unsigned-byte)
(let ((in (flexi-streams:make-flexi-stream stream :external-format :utf-8)))
(cst:source (maybe-read-form in))))
;; => (3 . 13)
Is there a reason to not just stick with byte offsets? They should be much faster when seeking in file streams.
Added character offset tracking to file input streams in #1136, and copied these tests there. No code assumes byte offsets: there hadn't been any tests containing multibyte characters.
Here are two tests that expose disagreement between the reader and error printer vis a vis the value of node-source: the reader seems to generate byte offsets, and the printer expects character offsets.