coalton-lang / coalton

Coalton is an efficient, statically typed functional programming language that supercharges Common Lisp.
https://coalton-lang.github.io/
MIT License
1.12k stars 67 forks source link

Test behavior of method names containing multibyte UTF-8 characters #1137

Closed jbouwman closed 2 months ago

jbouwman commented 2 months ago

Here are two tests that expose disagreement between the reader and error printer vis a vis the value of node-source: the reader seems to generate byte offsets, and the printer expects character offsets.

stylewarning commented 2 months ago

cool tests i must say

jbouwman commented 2 months ago

Somewhere, somebody is losing track of the UTF-8'ness of the input stream. Given a 4-multibyte-character input file:

% od -cx input.txt
0000000   非  **  **  常  **  **  に  **  **  重  **  **  \n
             9de9    e59e    b8b8    81e3    e9ab    8d87    000a
0000015

Built in reader (using :external-format, a SBCLism):

(with-open-file (stream "input.txt"
                        :external-format :utf-8)
  (symbol-name (read stream)))

=> "非常に重"

Coalton reader gets offsets wrong:

(with-open-file (stream "input.txt"
                        :external-format :utf-8)
  (cst:source (maybe-read-form stream)))

=> (0 . 12)

But string input streams are fine:

(with-input-from-string (stream "非常に重")
  (cst:source (maybe-read-form stream)))

=> (0 . 4)
(defun read-c-c (file)
  "read a file one character at a time"
  (with-open-file (stream file)
    (coerce (loop :for s := (read-char stream nil nil)
                  :while s
                  :collect s)
            'string)))

(with-input-from-string (stream (read-c-c "input.txt"))
  (cst:source (maybe-read-form stream)))

=> (0 . 4)
jbouwman commented 2 months ago

Remarkably, using flexi-streams to translate to UTF-8 produces worse results:

(with-open-file (stream "input.txt" :element-type 'unsigned-byte)
  (let ((in (flexi-streams:make-flexi-stream stream :external-format :utf-8)))
    (cst:source (maybe-read-form in))))

;; => (3 . 13)
eliaslfox commented 2 months ago

Is there a reason to not just stick with byte offsets? They should be much faster when seeking in file streams.

jbouwman commented 2 months ago

Added character offset tracking to file input streams in #1136, and copied these tests there. No code assumes byte offsets: there hadn't been any tests containing multibyte characters.