dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
18.99k stars 4.03k forks source link

Consider unifying scanning and parsing methods in `LanguageParser` #53976

Open cston opened 3 years ago

cston commented 3 years ago

There are a number of pairs of methods in LanguageParser for scanning a possible syntax construct and, if successful, parsing that construct.

Examples:

Having pairs of methods increases duplication and increases the chance that the methods are not in sync - see https://github.com/dotnet/roslyn/issues/53973 for example.

Consider unifying the pairs into single TryParse() methods instead.

cston commented 3 years ago

cc @333fred, @CyrusNajmabadi

CyrusNajmabadi commented 3 years ago

Yes. I fully agree. Though i def think this shoudl be done in small pieces (i.e. one pair at a time), along with a perf harness that shows this does'nt cause excessive regression.**

In practice, i've found that just having the TryParseXXX form is sufficient. Though in teh rare case that does cause a perf hit due to allocs that are thrown away, a really easy and effective (and maintainable) solution is to then have the pair of:

TryParseXXX (can parse, but has a rewind point to undo things if the speculative lookahead proved it was not valid). IsDefinitelyNotXXX. This is where extremely quick and simple scanning can be done to match common cases we know definitely could never succeed in TryParseXXX.

This pattern can be used to ensure that, for example, we're not constantly trying to parse a lambda when we see something like id, or id) or id+ etc. etc. etc. this function should be driven by traces and us looking at the set of true, costly, negatives we hit in real world code with TryParseXXX. SOmetimes it's possible to add just a tiny handful of cases that drops the number of times we need to parse by a high XX% overall.

--

**

I do think we shuold allow ourselves some amount of regression here. For example, if compiling got 2% slower i'd personally be ok with this :)

cston commented 3 years ago

See also #53827 (comment) for an alternative where IsPossibleLambdaExpression() could be simplified directly.

alrz commented 2 years ago

I think this is a must have for parsing collection literals which completely overlap with attribute lists.

To avoid thrown away allocations, I think an intermediate step could yield raw nodes which are reusable when converting to either attribute or an expression.

var x = [ident, ident(ident)]; // expression
var x = [ident, ident(ident)] void // attribute

In the first pass, all tokens are gathered where it's definitely possible to parse as either node. Specific wrappers will be created at any point a disambiguating token is encountered. It's also possible that such token is seen earlier where it's definitely cannot be parsed as the other node

@CyrusNajmabadi Is that worth it at all?

CyrusNajmabadi commented 2 years ago

To avoid thrown away allocations,

I don't think this is something that is actually important. Generally speaking, it's going to be a small amount of allocations in rare scenarios. e.g. in the above it will be much more likely to be a collection literal. So just start with that, but rewind if not ok.

The important thing is stopping the duplication of the Scan/Parse logic. I think it's totally fine to speculatively parse and then rewind if hte speculation is incorrect.