Remove zero-width space characters when pasting text into CPO.

asolove commented 5 months ago

Resolves #508.

Testing

I constructed a sample paste payload that included a valid function definition plus some ZWS characters. Here is what I got:

Before:

After:

Discussion

I think this change is net positive right away, but there's a larger possible discussion here:

There are a lot of formatting-only characters in Unicode, and eventually we'll run into others we also want to strip out.
But, those characters have meaning, or are even required, when used in human-readable text, so a smarter version of this would ideally leave them in comments and string literals that appear in the source. (Of course, that would require being able to parse the program text, which our parser currently can't do when these characters are included, so ...)

jpolitz commented 5 months ago

Re: parsing, the CodeMirror mode is often a good “fuzzy” parser that can be used for some of these cases. Anchor has a nice function for detecting if a block is complete when pressing enter by checking the mode's state. I think that checking getTokenAt and trusting CM's answer on if it's in a string or not could be a reasonable way to handle this. Since it's already on a syntactically-invalid file some fuzziness is fine.

For example: https://github.com/brownplt/pyret-lang/blob/anchor/ide/src/utils.ts#L79 (without reconstructing all the details in a comment, this is trying to determine if the current cursor position is e.g. right after an end that is fully outdented.)

jpolitz commented 5 months ago

(I guess in particular checking for being in a string is a tokenization problem rather than a parsing problem, and the CodeMirror mode is an excellent fuzzy tokenizer)

blerner commented 5 months ago

You can also run the actual Pyret tokenizer pretty easily, and if it produces an UNKNOWN token we should probably strip that character out...

schanzer commented 5 months ago

Relevance-through-adjacency: CodeMirror 6 has a much more powerful parser, which produces actual ASTs (complete with an API!). These ASTs are also fuzzy, but with the extra context available we may be able to make smarter decisions about what to do.

(It also opens up more options for ways to style text, provide hints, etc)

blerner commented 5 months ago

It also opens up incompatibilities with Pyret's actual parser and lexer, since we'd now have two grammars, two lexers and two parsers, etc. I've been hesitant about adding CM6's parsing for a while, for maintenance reasons. I don't know that Pyret's grammar is unambiguous enough for TreeSitter to handle. I do know that Pyret's delimiter syntax is unambiguous enough for our indenter to handle it, and I don't know that CM6's parse-error recovery matches our current behavior. I agree that having a better parser would be nice for better highlighting (it's a longstanding annoyance that type annotations don't get highlighted consistently/correctly), but it's a large code shift.

blerner commented 5 months ago

(There's also a technical compatibility issue I was never able to figure out -- Pyret's tokenization is ever so slightly whitespace-dependent: ( is tokenized two ways, depending on if it's preceded by whitespace, so that we can parse f (3 + 4) and f(3 + 4) differently, so that we can warn that the former is ill-formed with two expressions on one line. I don't think I can accomplish the same thing with CM6 and TreeSitter.)

schanzer commented 4 months ago

@blerner I spent some time last night reading Marijn's blog post about lezer, and poking around the Tree-Sitter website. I understood maybe 5% of what's there, but it does seem like Tree Sitter can handle an extraordinary number of language grammars.

The section on error-recovery is especially interesting, and Marijn talks a lot about using tricks from GLR parsing to deal with ambiguity, recovery, and more. I dimly recall you talking about GLR as well, years ago.

I know you've got far higher priorities than engaging in hypotheticals, but at some point I'd love to have you dive into this a bit and see if it's even possible.

schanzer commented 4 months ago

@blerner and @jpolitz anything we can do to get this PR merged sooner rather than later? There's an active need for it from the Algebra 2 crew

blerner commented 4 months ago

I think this could go in as-is, it's a low risk, minimal change.

schanzer commented 4 months ago

@jpolitz ane @blerner poking again about getting this merged and up on CPO

brownplt / code.pyret.org

Remove zero-width space characters when pasting text into CPO. #512

Testing

Discussion