Enable incremental analysis for 3rd party analyzers in IDE

dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.

https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/

MIT License

18.89k stars 4.01k forks source link

Enable incremental analysis for 3rd party analyzers in IDE #1565

Open mavasani opened 9 years ago

mavasani commented 9 years ago

IDE driver has this heuristic to optimize analyzer execution for method body editing scenarios: http://source.roslyn.io/#Microsoft.CodeAnalysis.Features/Shared/Extensions/DiagnosticAnalyzerExtensions.cs,64

Essentially this flag causes IDE services to avoid re-executing certain analyzers on entire document, but instead restricts them to just re-execute on the method body. The assumption here being method body edits can only affect diagnostics in the method body, for the set of analyzers whose category suffices to reach the above code path (essentially analyzers which only have symbol/node/code block actions). However, our diagnostic analyzer API has no restriction on where an action can report diagnostics, so a node or code block action is free to report diagnostic in a method body of some other method. This essentially invalidates the IDE driver's heuristic for incremental method body analysis. Basically no analyzer can support semantic span analysis, but removing this will most likely regress method body editing typing performance.

I think we have 2 options to address this: 1) Introduce an internal marker interface, ISemanticSpanAnalyzer, in Features layer for our built in analyzers that support semantic span analysis and do semantic span analysis only for these explicit analyzers. All public analyzers are conservatively assumed to not support span analysis.

2) Tighten our diagnostic analyzer API: We can possibly tighten our analyzer API to restrict what actions can report diagnostics at what location. This should infact also help analyzer authors to write more performant analyzers even for command line case. For example, a code block action reporting diagnostic in other code block seems likely a badly written analyzer, and should ideally be done via some other action (a compilation end action).

To me, it makes complete sense to restrict an action to report a diagnostic only at a location within the input code unit. Symbol action => nodes within its declarations + body, Node action => descendant nodes, Code block action => nodes within code block, Tree/CompilationUnit action => syntax tree, Compilation (end) action => entire compilation. If an action needs to do analysis across code units (between methods, between types, etc.), then it should be implemented with an analyzer action that operates on code units that encompasses all dependent code units. And if that is not possible, all such dependent actions can be implemented via stateful CompilationStart/End action pair with intermediate actions doing state manipulation and end action reporting diagnostics.

I feel doing 2) by itself will be a good thing for analyzer authors and help them write more performant analyzers. As an added bonus, it will also help us implement IDE driver's semantic span analysis in a supported way, rather than a heuristic.

@JohnHamby @heejaechang @srivatsn thoughts?

heejaechang commented 9 years ago

@mavasani ya, I like this. but didn't @JohnHamby already said no?

mavasani commented 9 years ago

@heejaechang John has the most context about analyzer API semantics and he should make the final call.

The proposal here is to make couple of things about explicit the semantics of each kind of action to meet our logical assumptions:

Scope of reporting diagnostics
Scope of analysis: This is the one that I am not very clear about, but I feel we do need some restriction to distinguish between declaration diagnostics and code block or body diagnostics.

Below are my proposed additions to semantics of each analyzer action, feel free to correct them or provide better alternatives: (a) Symbol action: Provide declaration diagnostics about the symbol. Diagnostics can only be reported at a location within symbol's declaration node and should only depend on semantic version of the compilation, i.e. no executable code block edits within the entire compilation should affect diagnostics from these actions.

(b) Syntax node action: Provide semantic diagnostics about the given node. Diagnostics can only be reported at a location within the given node and should only depend on its descendant nodes, i.e. no edits outside the given node should affect diagnostics from these actions.

(c) Syntax tree action: Provide syntax diagnostics for nodes within this tree. Diagnostics can only be reported at a location within the tree and should only depend on syntax version of the tree, i.e. no edits outside the given tree should affect diagnostics from these actions.

(d) Compilation unit or semantic model action: Provide semantic diagnostics for nodes within this compilation unit/tree. Diagnostics can only be reported at a location within this tree and should only depend on syntax version of the tree + semantic version of the compilation, i.e. no executable code block edits outside the given tree should affect the diagnostics from these actions.

(e) Code block action: Provide executable code/method body diagnostics for the given code block. Diagnostics can only be reported at a location within the code block and should only depend on syntax version of the code block + semantic version of the compilation, i.e. no executable code block edits outside this code block should affect diagnostics from these actions.

(f) Compilation action: Provide declaration level compilation diagnostics for the given compilation, that don't need any stateful analysis. Diagnostics can be reported at any location within the compilation or have no location.

(g) Compilation end action: Provide declaration or executable code diagnostics for the given compilation, that need stateful analysis across the compilation. Diagnostics can be reported at any location within the compilation or have no location.

I would like to claim that without any restriction on scope of analysis or diagnostic reporting on actions, couple of things are guaranteed:

No analyzer host can ever correctly do any incremental analysis of diagnostics across compilations. Each edit to a compilation can invalidate all analyzer diagnostics.
Any analyzer written violating these restrictions can be written in a more performant way by abiding by these restriction using some other action.

@srivatsn points out that current API semantics make for simpler API usage and having these additional semantics can be done through additional metadata on analyzer than an analyzer author opts into indicating "I am compliant with these additional semantics".

msJohnHamby commented 9 years ago

An opt-in mechanism for promising bounded behavior is an interesting idea. Manish's rules defining behavioral bounds are well thought out, but there are problems.

For most actions that we have today, expecting bounded behavior by default is very restrictive. Consider a rule that a type should have a certain attribute if it is used as a type parameter for a generic method. By the rules above, an analyzer to detect violations of this must do it using a compilation end action. Were it to be done with a syntax node action, the action would be restricted to reporting the issue at call sites--which one might argue is OK--but the syntax node action would not be rerun if the type were edited to add or remove the attribute and so is not well behaved.

Enforcing reporting diagnostics within a boundary is easy. (I'd prefer to see exceptions for violations rather than have diagnostics not show up, but that's a separate discussion.)

Enforcing independence of semantics outside a boundary is impossible, at least for actions that are given access to symbols. Essentially any action that refers to a symbol has strayed outside the boundary.

To correctly implement incremental analysis, an engine needs to know the semantic dependencies of an action. It is possible to collect these, but it requires every property access to/through a symbol or other meaningful structure to track dependencies. We have no chance of implementing such a mechanism correctly before RTM.

Syntax tree actions don't have access to symbols and so are intrinsically bounded. We could add syntax node actions that don't get a semantic model, and those would also be bounded. We could add symbol actions that get symbols with limited perspective (e.g. type symbols would have members available, but referenced type symbols would not provide much beyond names), and those would be bounded. We could add semantic model actions that get models with limited perspective, and those would be bounded. Bounded syntax node actions could be useful for analyzers concerned with syntactic matters (and would be easy to add), and bounded symbol actions could also have utility (but are more difficult to add, because we need to add incomplete symbols).

Getting correct incremental behavior for actions that have access to complete symbols is outside the realm of possibility for RTM.

mavasani commented 9 years ago

Thanks @JohnHamby, some comments below about your feedback:

1) "Consider a rule that a type should have a certain attribute if it is used as a type parameter for a generic method. By the rules above, an analyzer to detect violations of this must do it using a compilation end action." This diagnostic can be (and should be) implemented via a symbol action as it is a declaration diagnostic for the symbol. This diagnostic is obviously reported on the symbol name so can abide by the location restriction. It can also satisfy the second requirement I proposed for symbol actions: "should only depend on semantic version of the compilation, i.e. no executable code block edits within the entire compilation should affect diagnostics from these actions". Any edits within executable code blocks or method bodies in any source file in the compilation shouldn't change compilation's symbol table, and hence original diagnostic is not invalidated.

2) "Enforcing reporting diagnostics within a boundary is easy." +1 on this.

3) "Enforcing independence of semantics outside a boundary is impossible" Completely agree on this as well. Primarily the additional semantics should serve as a guideline for writing performant analyzers, but if an analyzer can completely abide by it, and declare so then it gets the added bonus of being incremental in IDE like environments. We can't enforce this through code, it just has to be an explicit contract the analyzer declares it obeys, so that when executed in an IDE-like analyzer host environment that works across compilations, it allows the analyzer driver to re-use its results from execution against prior compilations, improving it's session wide performance.

4) "To correctly implement incremental analysis, an engine needs to know the semantic dependencies of an action. It is possible to collect these, but it requires every property access to/through a symbol or other meaningful structure to track dependencies. We have no chance of implementing such a mechanism correctly before RTM." Agreed, our current diagnostic engine has a heuristic to determine which analyzers can be incrementally used for method body editing scenario, it uses kind of actions registered by the analyzer. This obviously is not 100% correct, and can never be unless we clearly define what these semantics are and how an analyzer can mark itself as incremental.

5) In your second last paragraph, you discuss how we can enforce the semantic restrictions through passing a more stringent or restrictive input context object. That is indeed going to be a hard problem, but great if it can be implemented. Until then we can rely on just documenting the additional semantics, and rely on analyzer author itself declaring whether or not it obeys them.

After discussing this further with @Srivatsn, we got a feeling that this work item, along with adding capability to analyzers to mark themselves as thread safe for concurrent action execution, falls into post RTM bucket of "enabling analyzers to be written in a more performant way for command line compiler and IDE environment". Let me modify the title of this issue to reflect this, and use this story to track this work.

heejaechang commented 9 years ago

I am not sure whether I understand everything but what @mavasani proposed should work except one case such as changing usage requires marking declaration as error. but those cases, I am not sure how this can be done except using compilation end action (stateful analyzer) otherwise it might report multiple diagnostics for same root cause (location)

...

by the way, for something like unused using, what would be best way to implement it? one can't use syntax action, with semantic action, author need to walk tree themselves to find out all usings used, syntax node action violate the restriction.

mavasani commented 9 years ago

@heejaechang For the first case "case such as changing usage requires marking declaration as error" I wouldn't call it a true declaration diagnostic, but instead something that just happens to have a location on the declaration as that is the best place to put it. The diagnostic is about references of a declaration across a compilation, hence logically a compilation wide analysis diagnostic. I think it would be functionally correct and most performant if implement through a compilation end action.

For the second case: I think correct implementation of unused using analyzer is correct, it uses semantic model action: http://source.roslyn.io/#Microsoft.CodeAnalysis.Features/Diagnostics/Analyzers/RemoveUnnecessaryImportsDiagnosticAnalyzerBase.cs,66

It avoids perf cost from double binding as it asks semantic model for unused diagnostics (which are cached by compilation): http://source.roslyn.io/#Microsoft.CodeAnalysis.CSharp.Features/RemoveUnnecessaryImports/CSharpRemoveUnnecessaryImportsService.cs,24 It also needs to be full tree level diagnostic as any edit in the tree can invalidate existing diagnostic. I think it can meet this requirement: "Compilation unit or semantic model action: Provide semantic diagnostics for nodes within this compilation unit/tree. Diagnostics can only be reported at a location within this tree and should only depend on syntax version of the tree + semantic version of the compilation, i.e. no executable code block edits outside the given tree should affect the diagnostics from these actions."

heejaechang commented 9 years ago

@mavasani I am not saying it can't be done. what you described above is exactly what v1 was doing before moving to DiagnosticAnalyzer. and it even had flag to indicate them and it even forced them (by filtering though rather than exception) I will be happy if we can get those ability back.

...

For the first case, what I was trying to say was what you proposed should work. and the case where one might say it has an issue should be implemented in compilation end action at the first place, so it shouldn't be an issue.

the second case, I know how currently it is done. we kind of cheat since compiler actually does the work for us and return it as diagnostic. so we just use semantic action which has right scope to report it. what I was asking is, if compiler didn't do it for us, how one can implement it through our API.

mavasani commented 9 years ago

@heejaechang Regarding the explicit flag for "SupportsSpanBasedAnalysis" or incremental analysis. Yes, we need the analyzer author to be declarative about those things, but probably it is better to have a higher level flag/attribute about the way analyzer has been implemented rather than do it through some specific flag that talks about "span" or "incremental analysis". We also need to make this optional, so only analyzer authors that sincerely care about analyzer performance need to dive deeper into the additional perf related API semantics. John is the best person to make a call on this though.

Regarding the second case, I agree, unused usings is probably not representative of all true compilation unit analyzers. I remember we had talked about a RegisterCompilationUnitStartAction and RegisterCompilationUnitEndAction, but in command line builds we don't bind everything within a syntax tree at the same time, so holding onto a semantic model that caches all bound nodes for the tree from start action till end action, might make us hold it for entire compilation lifetime. Design team decided such analyzers should be rare and discouraged. Users should instead use compilation end analyzers for such diagnostic.

mavasani commented 8 years ago

Update: IDE analyzer driver v2 added in 1.1 completely removes the heuristic for all 3rd party analyzers - there is no incremental analysis for non-BuiltInAnalyzers in the IDE