Pyret is not fully Unicode-aware

jswrenn commented 3 years ago

A CS19 student inadvertently fuzz tested the Pyret compiler with a randomly generated string containing the character 𢡊. The appearance of this character, even in a comment, causes CPO to display an internal error.

Try it here: https://code.pyret.org/editor#share=10Tl4poz5CL9FoXQ4b6Ku5Gn0ol0xoczO&v=8c4da7d

blerner commented 3 years ago

Can you provide a standalone test file, and see if it fails using src/scripts/just-lex.js or src/scripts/just-parse.js

jswrenn commented 3 years ago

Another student just triggered the bug with ਖ਼.

I'm currently on a bike ride, but I can do this once I get back home.

jswrenn commented 3 years ago

Broken:

$ node src/scripts/just-lex.js <(printf '#|𢡊|#')
/dev/fd/63
TypeError: Cannot read property 'name' of undefined
    at Tokenizer.next (/home/john/projects/pyret-lang/lib/jglr/jglr.js:319:34)
    at /home/john/projects/pyret-lang/src/scripts/just-lex.js:24:17
    at Object.execCb (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1941:33)
    at Module.check (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1116:51)
    at Module.enable (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1428:22)
    at Module.init (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1034:26)
    at Timeout._onTimeout (/home/john/projects/pyret-lang/node_modules/requirejs/bin/r.js:1704:36)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)

...working(?):

$ node src/scripts/just-lex.js <(printf '#|𢡊|#\n')
/dev/fd/63
Atom {
  name: 'EOF',
  asString: '$',
  key: '$',
  pos:
   SrcLoc {
     startRow: 2,
     startCol: 0,
     startChar: 7,
     endRow: 2,
     endCol: 0,
     endChar: 7 } }
=========================================================
{ 'RATIO     ': '11.5402569773' }
{ 'num NAMES ': undefined }
{ 'total toks': 1 }
WS : sumNew: 0.0003318410 numNew: 1     avgNew: 0.0003318410       sumOld: 0.0044162990 numOld: 1     avgOld: 0.0044162990       ratio (old/new): 13.3084790608

Note the ending newline in the latter case.

blerner commented 3 years ago

The lack of a trailing newline is a red herring here; it has almost nothing to do with what's going on.

The actual problem doesn't surprise me much. The two characters you've pasted are multibyte characters, and JS has really lousy support for working with them robustly. Specifically, as its counting srcloc positions, it advances over each character., rather than each codePoint. You can force the code to not crash by changing line 219 of jglr.js to say this.length = this.str.length instead of this.length = str.length, which will count lengths using the NFC-normalized version of the input string. But, this still screws up the counting of srcloc info: for instance, for your example file, we get that the EOF token is at

Atom {
  name: 'EOF',
  asString: '$',
  key: '$',
  pos: SrcLoc {
    startRow: 1,
    startCol: 5,
    startChar: 6,
    endRow: 1,
    endCol: 5,
    endChar: 6
  }
}

and I do not know whether that mismatch between startCol and startChar will wind up mattering, somewhere down the line.

What I'm worried about, though, is that even if we "fix" this line of code, you can produce Pyret files that snarl up the lexer, parser, highlighter, error locations, etc, just by having a bunch of multibyte characters early in the file, and then an error later in the file.

I'm not sure the 1-line "fix" I mention is the right one, and it'll take someone who's much more expert than I am in JS and browser (and CodeMirror!) Unicode handling, to figure out if we have a robust enough fix. (And, I definitely do not have the cycles right now to teach myself the necessary background.)

jpolitz commented 3 years ago

I don't think it's surprising that this causes an error outside of strings, right? Pyret source is only intended to be ASCII, except for inside strings.

It's a bug if it doesn't work in string literals, though, because we proudly have lots of examples of multibyte characters in strings (https://www.pyret.org/docs/latest/strings.html), and they work.

FWIW the character in the original message doesn't even render on my machine on Github, so I can't reproduce by copy/pasting.

But the second example parses fine in a string:

So what's special about that first character? Do we know how it was generated? Can someone tell me what it is and what its code point(s) are?

blerner commented 3 years ago

I think it parses, but I'm still tentatively concerned that if a later expression (after that multibyte string) has an error, our srcloc info will be wrong by a character or so. I've always worried about this, and just quietly hoped it wouldn't ever come up ;-) Recognizing characters in strings and recognizing characters in comments are two slightly different codepaths in the lexer, since strings have escape sequences in them, and so the lexer needs to be more particular.

jswrenn commented 3 years ago

Can someone tell me what it is and what its code point(s) are?

CJK COMPATIBILITY IDEOGRAPH-FACF

brownplt / pyret-lang

Pyret is not fully Unicode-aware #1534