Meta.parse(repr(Char(0x110000))) fails

JuliaLang / julia

The Julia Programming Language

https://julialang.org/

MIT License

45.68k stars 5.48k forks source link

Meta.parse(repr(Char(0x110000))) fails #54396

Open stevengj opened 5 months ago

stevengj commented 5 months ago

Meta.parse(repr(Char(0x110000))) fails because

julia> show(Char(0x110000))
'\U110000'

but '\U110000' is not parseable:

julia> '\U110000'
ERROR: ParseError:
# Error @ REPL[17]:1:2
'\U110000'
#└──────┘ ── invalid unicode escape sequence

isvalid(Char(0x110000)) is false, but other invalid characters are parsed okay:

julia> '\ud800'
'\ud800': Unicode U+D800 (category Cs: Other, surrogate)

julia> isvalid('\ud800')
false

so this seems kind of inconsistent.

Options are either (a) change the printing of Char(0x110000) or (b) change the parsing to allow this. I lean towards (a). Thoughts?

Seelengrab commented 5 months ago

I think this is a bug in the parser. What would the printing be changed to to make it parse? Just using u doesn't work because then the literal is too large:

julia> '\u11000'
ERROR: ParseError:
# Error @ REPL[27]:1:2
'\u11000'
#└─────┘ ── character literal contains multiple characters
Stacktrace:
 [1] top-level scope
   @ REPL:1

stevengj commented 5 months ago

The printing could be changed to '\xf4\x90\x80\x80', by calling Base.show_invalid, for example. ('\U110000' is a lot more understandable, but is meaningless from the perspective of Unicode.)

It could also print as Char(0x110000), but that's a pretty radical change from how other characters are printed.

If we extend the parser to allow this, I guess we would parse up to '\U1fffff', since Char(0x200000) throws an error. That seems reasonable to me, since there is still a clear upper bound on what we should parse.

Seelengrab commented 5 months ago

The manual has that exact value as an example, and documents that up to the following 8 bytes are allowed for \U, so I'd be in favor of fixing the parser.