esmolanka / sexp-grammar

Invertible parsing for S-expressions
33 stars 4 forks source link

Symbols should be allowed to start with a digit #14

Closed athas closed 4 years ago

athas commented 4 years ago

In most (all?) Lisps, symbols are allowed to start with a digit. Tokens like 1i2, 1_2, etc. should be considered symbolic atoms. The current lexer explicitly behaves otherwise, but I'm wondering if there's a good reason for this. I don't think there would be any ambiguity in being more flexible.

athas commented 4 years ago

There is arguably a small ambiguity with scientific notation (2e3 is numeric), but I don't think it matters. And any consumers that need to treat 2e3 as symbolic can just accept numerical symbols instead. My issue is that tokens like 1i2 are not accepted as symbols at all.

athas commented 4 years ago

Note that this is technically a change in behaviour, since currently (0i2) is parsed as a list containing the two symbols 0 and i2. I would however argue that this is a tokenization bug.

esmolanka commented 4 years ago

You are right, the lexer treats 0i2 as two separate tokens however it should have either accepted it as a symbol or raised an error (depending on which behaviour we like better).

I also agree that 0i2 should really be treated as a symbol. However the clash between 2e3-the number and 2e3-the symbol is quite unpleasant. And that would be a shame if you can enter any symbol starting with a digit you like but not something like [0-9]+e[0-9]+. Reading it as a number and then somehow recovering the symbolic representation of it does not work: 01e0 reads as 1.0 which is textually quite different.

The only solution I can think of is to ban scientific notation of numbers in the lexer. And if anyone wants to read number in scientific format, it's always possible to parse it from the corresponding symbol. E.g. real and double grammars could accept both AtomNumber and AtomSymbol of a certain shape. (See implementation of real)

However there is one more problem with that: 1e2 would be perfectly parseable by both symbol and real grammars but e.g. 100 won't be parsed as symbol anymore. This also looks like an inconsistency. Also, it adds burden of dispatching between AtomNumber and AtomSymbol+parsing of scientific format to clients who use Language.Sexp directly, without the invertible grammar stuff. OTOH, this is probably minor comparing to not being able to have symbols starting with a digit at all.

esmolanka commented 4 years ago

Please see ad73a88

athas commented 4 years ago

Looks good to me!