Open jswrenn opened 3 years ago
Can you provide a standalone test file, and see if it fails using src/scripts/just-lex.js
or src/scripts/just-parse.js
Another student just triggered the bug with ਖ਼.
I'm currently on a bike ride, but I can do this once I get back home.
Broken:
$ node src/scripts/just-lex.js <(printf '#|𢡊|#')
/dev/fd/63
TypeError: Cannot read property 'name' of undefined
at Tokenizer.next (/home/john/projects/pyret-lang/lib/jglr/jglr.js:319:34)
at /home/john/projects/pyret-lang/src/scripts/just-lex.js:24:17
at Object.execCb (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1941:33)
at Module.check (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1116:51)
at Module.enable (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1428:22)
at Module.init (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1034:26)
at Timeout._onTimeout (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1704:36)
at ontimeout (timers.js:436:11)
at tryOnTimeout (timers.js:300:5)
at listOnTimeout (timers.js:263:5)
...working(?):
$ node src/scripts/just-lex.js <(printf '#|𢡊|#\n')
/dev/fd/63
Atom {
name: 'EOF',
asString: '$',
key: '$',
pos:
SrcLoc {
startRow: 2,
startCol: 0,
startChar: 7,
endRow: 2,
endCol: 0,
endChar: 7 } }
=========================================================
{ 'RATIO ': '11.5402569773' }
{ 'num NAMES ': undefined }
{ 'total toks': 1 }
WS : sumNew: 0.0003318410 numNew: 1 avgNew: 0.0003318410 sumOld: 0.0044162990 numOld: 1 avgOld: 0.0044162990 ratio (old/new): 13.3084790608
Note the ending newline in the latter case.
The lack of a trailing newline is a red herring here; it has almost nothing to do with what's going on.
The actual problem doesn't surprise me much. The two characters you've pasted are multibyte characters, and JS has really lousy support for working with them robustly. Specifically, as its counting srcloc positions, it advances over each character., rather than each codePoint. You can force the code to not crash by changing line 219 of jglr.js to say this.length = this.str.length
instead of this.length = str.length
, which will count lengths using the NFC-normalized version of the input string. But, this still screws up the counting of srcloc info: for instance, for your example file, we get that the EOF token is at
Atom {
name: 'EOF',
asString: '$',
key: '$',
pos: SrcLoc {
startRow: 1,
startCol: 5,
startChar: 6,
endRow: 1,
endCol: 5,
endChar: 6
}
}
and I do not know whether that mismatch between startCol and startChar will wind up mattering, somewhere down the line.
What I'm worried about, though, is that even if we "fix" this line of code, you can produce Pyret files that snarl up the lexer, parser, highlighter, error locations, etc, just by having a bunch of multibyte characters early in the file, and then an error later in the file.
I'm not sure the 1-line "fix" I mention is the right one, and it'll take someone who's much more expert than I am in JS and browser (and CodeMirror!) Unicode handling, to figure out if we have a robust enough fix. (And, I definitely do not have the cycles right now to teach myself the necessary background.)
I don't think it's surprising that this causes an error outside of strings, right? Pyret source is only intended to be ASCII, except for inside strings.
It's a bug if it doesn't work in string literals, though, because we proudly have lots of examples of multibyte characters in strings (https://www.pyret.org/docs/latest/strings.html), and they work.
FWIW the character in the original message doesn't even render on my machine on Github, so I can't reproduce by copy/pasting.
But the second example parses fine in a string:
So what's special about that first character? Do we know how it was generated? Can someone tell me what it is and what its code point(s) are?
I think it parses, but I'm still tentatively concerned that if a later expression (after that multibyte string) has an error, our srcloc info will be wrong by a character or so. I've always worried about this, and just quietly hoped it wouldn't ever come up ;-) Recognizing characters in strings and recognizing characters in comments are two slightly different codepaths in the lexer, since strings have escape sequences in them, and so the lexer needs to be more particular.
Can someone tell me what it is and what its code point(s) are?
A CS19 student inadvertently fuzz tested the Pyret compiler with a randomly generated string containing the character 𢡊. The appearance of this character, even in a comment, causes CPO to display an internal error.
Try it here: https://code.pyret.org/editor#share=10Tl4poz5CL9FoXQ4b6Ku5Gn0ol0xoczO&v=8c4da7d