dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
18.95k stars 4.02k forks source link

Logical locations for diagnostics #11345

Open nguerrera opened 8 years ago

nguerrera commented 8 years ago

The SARIF format provides optional allows analysis results to be associated with "logical locations" rather than just the "physical locations" represented by path + line and column ranges. A logical location identifies the target module/namespace/type/member and how it sits in the hierarchy.

During offline review of #11233, it was suggested that we also log "logical locations" in the /errorlog. However, AFAICT, Diagnostic objects do not carry enough information to implement that.

It is interesting that most of the diagnostics raised by dotnet/roslyn-analyzers go through helpers that take symbols and syntax node targets: https://github.com/dotnet/roslyn-analyzers/blob/master/src/Analyzer.Utilities/Extensions/DiagnosticExtensions.cs. However, these helpers currently have no way to permanently associate these nodes with the diagnostics they create.

~I think that these node -> diagnostic helpers should be part of the core and Diagnostic or Location should be amended to retain the target nodes for subsequent logging.~

Note that this there is a related, major issue in the IL to IOperation binary analysis. For example, we currently receive diagnostics against IL that, for example. say only "Avoid unused field" with no indication of which field is unused.

@lgolding @michaelcfanning @dotnet/roslyn-analysis

mavasani commented 8 years ago

Diagnostic or Location should be amended to retain the target nodes for subsequent logging

This cannot happen as we don't want to hold onto the syntax nodes in memory while still holding onto diagnostics/locations, which just refer to the tree and span.

nguerrera commented 8 years ago

@mavasani Interesting. Syntax nodes are too fine-grained for the scenarios here anyhow. To what extent would keeping only the parent ISymbol around be detrimental? Do we have scenarios where Diagnostic objects outlive them?

mavasani commented 8 years ago

Yes, IDE creates and holds onto diagnostic objects across compilations, so holding onto symbols will leak compilations. Additionally, we also serialize and deserialize diagnostics in IDE using Esent storage (that is how we made error list scalable to hold million of errors), so diagnostic must stay serializable. @heejaechang can give you more context, but I don't think we can hold onto any node or symbol data in diagnostics.

heejaechang commented 8 years ago

@nguerrera any compilation objects (anything that is from roslyn) shouldn't live out side of analyzer. basically all roslyn objects should be let go once CompilationEndAction is called if it is stateful analyzer, or only during each Actions if it is stateless analyzer.

also, I dont think current Diagnostic implementation let you stick any object to it. property bag is strictly set to (string, string) map to prevent any such wrong doing.

...

diagnostic object is supposed to be a cheap currency engine can decide how long it will hold onto, how many it will hold onto and how to put them aside when not needed.

any roslyn compilation objects are considered heavy currency which will consume tens or hundreds of megabytes in memory which should short live as much as possible. especially since many of those objects are interconnected to full graph. (doesn't matter what is held onto, it will make whole graph to a live in memory) - which make it easier for API consumer to use the graph since it has all parent pointers and all. but heavy in a sense of memory usage.

so expected pattern is analyzer uses raw data (roslyn graph) to create meaningful information (diagnostics) which is compared to roslyn objects really cheap.

...

I think what you want is this. http://source.roslyn.io/#Microsoft.CodeAnalysis.Workspaces/SymbolId/SymbolId.cs,42

basically, you need to write a version of this public static string CreateId(ISymbol symbol)

but doesnt take ISymbol but something from IL that let you construct string that can be used with this public static ImmutableArray GetSymbolsForId(string id, Compilation compilation)

this will resolve the id to a symbol in the given compilation (snapshot), once you have symbol that belong to specific snapshot of compilation, you should be able to get list of locations for that snapshot for the symbol.

heejaechang commented 8 years ago

by the way, I believe we shouldn't stick this logical location information for live diagnostics except when needed (such as saving to serif format), creating those information is expensive and doing that all the time just in case the diagnostic might get saved to serif will be a unnecessary big perf impact for live analysis.

nguerrera commented 8 years ago

Thanks for all of the pointers and context, @heejaechang

So looking at what a logical location actually needs to contain, I don't think I actually need to get back to a symbol. We can likely serialize something roughly as compact as SymbolId that gives us enough to log the logical locations.

I agree making this ID generation happen only when necessary would be a big win. To do so, we'd need some way for whatever is driving the analysis to make the call. I think that would mean something like reportDiagnostic callback getting a target symbol from analyzer and deciding whether to gen an ID and stash it with the diagnostic or just ignore it because there is no scenario in current context. Am I missing a better way?

mavasani commented 8 years ago

I think that would mean something like reportDiagnostic callback getting a target symbol from analyzer and deciding whether to gen an ID and stash it with the diagnostic

Unfortunately, the analyzer driver cannot do this. A diagnostic reported by analyzer must roundtrip to the code fix, and the code fix provider might have assumptions on properties stashed onto the diagnostic it receives (given that the diagnostic was reported by an analyzer implemented alongside it) - we cannot stash any extra information to the diagnostic.

I think the best approach is for the error logger to walk the diagnostics (as generated right now), and then map the diagnostic location (if in source) back to a node/symbol - this is how the source suppression works for determining if reported analyzer diagnostics should be suppressed in source (see here)

nguerrera commented 8 years ago

@mavasani Thanks. Given everything I've now learned, it does seem like the most promising approach. It still leaves me stuck in the binary analysis case and I was imagining there might be a two birds with one stone solution lurking...

I just noticed that there is already the (theoretical) possibility of a symbol tree being held on to by a diagnostic and getting lost in the serialize/deserialize round-tree. If Diagnostic.Location is a MetadataLocation. The docs on Location mention this:

/// <summary>
/// Returns the metadata module the location is associated with or <c>null</c> if the 
/// module is not available.
/// </summary>
/// <remarks>
/// Might return null even if <see cref="IsInMetadata"/> returns true. The module symbol
/// might not be available anymore, for example, if the location is serialized and deserialized.
/// </remarks>

Perhaps, we can backwards from source diagnostics exactly as you suggest, but still augment MetadataLocation with the actual descendant target symbol (and equally not serialize it)? (Diagnostics with metadata locations won't happen in practice in existing scenarios, but would be the norm for binary analysis...)

nguerrera commented 8 years ago

(Also, note that if there were a reportDiagnostic(Diagnostic, ISymbol), it would not need to imply that the ISymbol comes back to the fixer -- if/how/where the driver associates that symbol with the diagnostic would be a driver implementation detail. I'm still liking where the previous comment thread is going better, but still to wanted to clarify that point.)

heejaechang commented 8 years ago

@nguerrera currently, as soon as diagnostic is reported, any diagnostic not belong to source will be dropped to floor in IDE. compiler might not care since everything is in memory anyway per project while doing compiling, analyzing, emitting.

heejaechang commented 8 years ago

one more thing. in IDE, as soon as it gets diagnostic from compiler, it converts diagnostic to IDE own internal type (DiagnosticData) which will drop anything compiler's internal implementation hold onto.

so, even if IDE is changed to support diagnostic for metadata, our internal type for diagnostic will probably never hold onto symbol.

nguerrera commented 8 years ago

@heejaechang That seems fine.

Here's what I was getting at -- we'd have two distinct ways to handle logical location:

  1. For binary analysis: we enhance MetadataLocation to have not just the outer IModuleSymbol but also the ISymbol that's closest. (We still only need to back it by one symbol as we can always get ContainingModule). IDE can continue to drop these diagnostics. (And if the compiler drops these now, it can continue to drop them as well as the binary analyzer will remain a separate tool.)
  2. For source analysis/compilation: we do as @mavasani suggests and map the source location back to a logical entity in the /errolog implementation.

Basically, this changes nothing with respect to what is carried by source diagnostics and follows the existing precedent that MetadataLocation can carry a symbol that is not guaranteed to be kept around.

heejaechang commented 8 years ago

@nguerrera one question, so binary analyzer can report issues only at declaration level?

nguerrera commented 8 years ago

@heejaechang That would be a step forward in the prototype so far (where I have no locations yet ;)),

We'll ultimately want to map IOperations to IL offset ranges. That could be hung off of non-source location too.

We can log the offset in the binary analysis SARIF, but also map it back to source location if we have PDBs. PDB approach can't be only option, though, because (1) we don't necessarily have a PDB and (2) PDBs don't currently contain location of all declarations.