dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
19.01k stars 4.03k forks source link

VB parsing string-interpolation in erroneous code has poor error recovery #3617

Open ljw1004 opened 9 years ago

ljw1004 commented 9 years ago

Watch this video: string-interpolation-parsing

I get into this situation often when I'm copying+pasting code around, or breaking a method up into smaller methods by deleting "sub" and then re-adding a bunch of "sub" and "end sub" statements where appropriate.

WHAT I OBSERVE: When there's a valid statement, regular strings and interpolated string parse fine.

When there's an invalid statement, regular strings parse fine, but interpolated strings don't recognize their terminating quote, so the remainder of the entire file gets "stringed out".

WHAT I EXPECT: interpolated strings should believe themselves terminated at the terminating quote, just like regular strings do.

ljw1004 commented 9 years ago

Paging @AnthonyDGreen

dpoeschl commented 9 years ago

Everything from "console" to the end of the file is parsed as SkippedTokensTrivia on an EmptyStatement.

AnthonyDGreen commented 9 years ago

Looks like a reasonable suggestion. We've talked for a while about coming up with some consistent rules for VB error recovery and I think this is good input for that discussion.

AdamSpeight2008 commented 9 years ago

Is it actually being parsed correctly?

ljw1004 commented 9 years ago

I know it sounds silly, but I'm suffering quite badly from this. Here's another example. I accidentally missed a closing parenthesis while editing this line, and the entire rest of my program gets "stringed out." vb-bad-parse

I hope this particular case of string interpolation can be addressed urgently as an Update1 issue, rather than postponed to be part of some greater holistic whole.

AdamSpeight2008 commented 9 years ago

It should be biased toward the closing partner of syntax already parsed but not yet closed. eg

 ( " {    (    }" )

 | | |    |    || |
 | | |    +--? || |
 | | +---------+| |
 | |            | |
 | +------------+ |
 |                |
 +----------------+
gafter commented 9 years ago

@AnthonyDGreen I don't see this in recent bits from master. Is it fixed in stabilization?

dpoeschl commented 9 years ago

@gafter @AnthonyDGreen It repros for me against master. The exact code I'm trying is:

Module Module1
    console.writeline($"a")
    Sub Main()
    End Sub
End Module

and

Module Module1
    Sub Main()
        Dim x = 0.5
        Console.Write($"{Math.Abs(x}")

        ' Comments
    End Sub

    Sub M()
        ' This is all still part of the string
    End Sub
End Module
AdamSpeight2008 commented 9 years ago

Is the issue at the parsing level, or is it in the symbol / syntactic level?

AnthonyDGreen commented 9 years ago

@AdamSpeight2008, I'm not sure what you mean by symbol/syntactic - do you mean lexical (at the token level)?

To be clear the issue is not that there is a bug anywhere. The parser is written to parse incomplete code in a particular way and we're discussing changing that way based on a refined view on what the user probably intended in this particular circumstance. You'll see the same behavior here with XML literals:

Module Module1
    Sub Main()
        Dim x
        Console.WriteLine(<xml><%= Math.Abs(x %></xml>)
    End Sub
End Module

Because of the missing closing parenthesis the XML is never considered closed and consumes all text until the end of the file. What's happening in this case is that like XML literals an interpolated string is a special embedded language in VB. Normally in VB a double quote character can only begin a string-literal-token. But, when the scanner encounters a dollar sign and double quote character back to back in VB's normal expression parsing context it creates a dollar-sign-double-quote token-which triggers the parser to enter a special mode just for interpolated strings. In this mode a double quote character should produce a double-quote-token instead and terminate the string. Like in XML, the interpolations themselves cause an interleaving of scanner modes where the parser has to switch back and forth between standard VB expression rules and interpolated string rules. Recovering at the boundary is the issue. When the parser parses x it looks ahead for something that could continue an expression plus-token (or any other binary operator). In looking ahead it causes the scanner to read the program at that point using VB standard rules. It sees a double quote character, assumes it's starting a string, and reads to the end of the string (which it doesn't find), attaches the malformed string as garbage, and then the inner invocation expression, the interpolation, the interpolated string itself, the outer invocation, the sub, and the module all look for the valid tokens/statements to close themselves but instead find the end of file and so they all parse as malformed. This is what it's designed to do, right now. The case where things are broken outside of method bodies is because we don't parse things as executable statements outside of method bodies. There are slightly different rules. For example, the open < that starts an attribute or the : after an attribute target specifier are different than if they were just inside of a method body. Dim x is a field and not a local declaration. We need to come up with a different approach.

Interestingly, the problem is exacerbated by multi-line string literals. In prior versions string lexing would give up at a newline, so the corruption would end on that line and at least the sub and module could close normally. I guess if we made error recovery more lazy so that instead of greedily consuming garbage tokens it left them for its parents to handle we'd be in a better shape. But that could cause other scenarios to become worse - we'll need to think it through.

Thought through a bit. Take this code as an example:

Module Module1
    Sub Main()
        Console.WriteLine($"{Math.Abs(1 2, 3)}")
    End Sub
End Module

In this case when ParseExpression gets 1 it looks ahead for valid continuation, finds a 2, greedily consumes it as garbage and resyncs to right before the comma - this leaves things in a great state for ParseArgumentList because it has skipped the 2 and continues parsing the arg list correctly, the intepolation, interpolated string, outer invocation, sub, and module are all now well-formed. If the recovery were lazy and didn't skip the 2 everyone would fail and the closing quote would still get tokenized as a string start. Now I guess if we made a special recovery scanning mode it at least would always tokenized skipped strings as single-line - then the corruption would be limited to the line, which is probably for the best. It's at least much better than what we're doing now.

Alternately, we could maintain a stack of pending scanner states and when resyncing (skipping over tokens to get back to a good place) we could try each state in tern to see if we see any tokens that any of our parents could continue or close on in LILO priority order. So in Lucian's second example after grabbing the x it would peak ahead in standard VB mode and see a }, recognize that it could't do anything with the } and try rescanning in interpolated string mode still get a } and see that it could recover and bail, the invocation would repeat this and get the same answer and ParseInterpolation would recover fine and so the closing quote would scan correctly. That's a bit more state to track though, a much more invasive solution.

There might be a scenario where a parent further out might be a better place to resync than inside? Like does the inner array literal or the outer interpolation get the }? I don't know - but that's a different problem than what we're discussing and I'm willing to cross that bridge when we get there.

+@JohnHamby who understands the madness (the concept of the compiler knowing when to give up). +@cston with whom I discussed recovery options for XML literals in similar situations for countless hours. +@VSadov who helped write the VB scanner and will try to talk us out of my crazy schemes.

ljw1004 commented 9 years ago

Anthony, from what you've said, this is a regression bug. VS2013 had great error recovery.

vs2013

vs2015

AnthonyDGreen commented 9 years ago

Regression! You've said the magic word!

Though I think the interpolated string issues you describe are more common, likely, a regression is also quite serious. Luckily, I think we'll be able to come up with a solution that heals two birds with one song. Great follow-up!

AdamSpeight2008 commented 9 years ago

@AnthonyDGreen From your description then I mean Lexical, the stage after parsing raw characters.

Also regrading the Regression, is that VS2013 with / without Roslyn?

jbparker commented 9 years ago

Just want to add support for this change / fix as well. Interpolated strings are a fantastic feature of the language, but this is causing some challenges on our team internally since we've fanatically adopted them in favor of String.Format. Would also love to see this addressed on Update 1 / optional patch.

jaredpar commented 8 years ago

Seems related to #3771