Reader: symbols beginning with numbers and underscore cause part of the input to be discarded

egallesio / STklos

STklos Scheme

http://stklos.net

GNU General Public License v2.0

68 stars 17 forks source link

Reader: symbols beginning with numbers and underscore cause part of the input to be discarded #579

Closed jpellegrini closed 10 months ago

jpellegrini commented 11 months ago

Hi @egallesio ! A minor issue, I think:

If a token starts with a number, and it does not parse as a number. then STklos reads it as a symbol:

1abc => symbol |1abc|

However, if it has underscores, then -- depending on the character that follows the udnerscore --the part after the underscore is discarded -- it's neither part of the symbol, nor read as a separate token later:

1_abcdef => symbol |1_|

and the abcdef is lost...

stklos> (define 1_abc -10)
;; 1_
stklos> 1_o
-10
stklos> 1_p
-10
stklos> 1_i
-10
stklos> 1_d
**** Error:
%execute: symbol `1_d' unbound in module `stklos'
stklos> 1_def
**** Error:
%execute: symbol `1_def' unbound in module `stklos'

(I don't know why it kept d and def but not the others)

Anyway... I guess this was introduced with the SRFI that allows underscores in numbers (so the mistake is mine, since I have implemented those changes to the reader -- sorry!) I'll take a look later.

jpellegrini commented 11 months ago

Some more information:

1_2_a => symbol |12_|

jpellegrini commented 11 months ago

I can't work on this right now, but here's an idea (I'll try to implement later):

read_integer_or_real makes a copy of the string and works on it;
if it was a number, measure its size and update **end
if it was not a number, return #f without updating anything.

jpellegrini commented 11 months ago

read_integer_or_real makes a copy of the string and works on it;

It has no idea where the end of the token is, so not viable. Another idea is for remove_underscores to make a copy, allocating a buffer and growing it as necessary. This reduces the overhead to only tokens with underscores.

lassik commented 11 months ago

Generally the best idea for reading Lisp is:

Treat |...| vertical bar symbols like strings.
For other symbols, as well as numbers, read a "token" of symbol-or-number characters into a string and then parse that string.

In Scheme pseudo-code, something like this:

(case first-char
  ...
  ((#\")
   (read-delimited-string #\"))
  ((#\|)
   (string->symbol (read-delimited-string #\|)))
  (else
   (let ((token (read-token first-char)))
     (if (string=? token "")
         (error "Invalid syntax")
         (or (parse-number token)
             (string->symbol token))))))

lassik commented 11 months ago

NB: RnRS uses a BNF-style grammar, but that's IMHO a really confusing mental model for Lisp. Common Lisp does not use BNF; it presents a reader algorithm tailor made for Lisp.

egallesio commented 11 months ago

I think I have a fix for that issue. I'll clean it and test it more later.

egallesio commented 10 months ago

Generally the best idea for reading Lisp is:

* Treat `|...|` vertical bar symbols like strings.

* For other symbols, as well as numbers, read a "token" of symbol-or-number characters into a string and then parse that string.

In Scheme pseudo-code, something like this:

(case first-char
  ...
  ((#\")
   (read-delimited-string #\"))
  ((#\|)
   (string->symbol (read-delimited-string #\|)))
  (else
   (let ((token (read-token first-char)))
     (if (string=? token "")
         (error "Invalid syntax")
         (or (parse-number token)
             (string->symbol token))))))

This is what we already do in STklos. The problem was that parse-number may modify token to make it readable by standard C functions (for instance, if the number uses d exponent notation, we replace it by and e before to send the string to standard strod). One solution is of course to make a copy of token in parse-token but doing it for each token costs a lot.

Current implementation avoids allocation as much as possible and should fix the issue.