gundermanc commented 1 year ago

As a consumer of the Roslyn LSIF generator tool in an automated environment, I have little visibility into when/why it may fail in aggregate. Currently, when a failure is encountered, we must open up and dig into a specific job, repro the issue, and find and sift through the logs or debug the tool. This process is time consuming and makes it hard to triage and act on issues.

I propose that the Roslyn LSIF generator tool be updated to support the following 'standard error protocol' (note that the protocol could be written to a pipe or other output stream).

Protocol Spec

The stream is divided into lines. Each line, using \r, \r\n, or \n as a line separator, is treated as its own separate command.
Commands are parsed by taking a line in its entirety and parsing it with a standard JSON parser.
Empty lines are ignored.
Non-JSON lines are allowed but are re-dispatched as 'log' commands with 'error' severity.
'Telemetry' can be passed with any command. This is a set of string key => { string or double } value properties that are Roslyn or language specific and may be aggregated for diagnostic purposes.
Each command follows the JSON schema:

{
  "command": "log",
  "parameters": {
    // One or more command specific parameters.
  },
  "telemetry": {
    // Any arbitrary measurements or health metrics that should be reported.
  }
}

Log command

Initially, log is the only supported command. Here is an example:

{
  "command": "log",
  "parameters": {
    "severity": "Error",
    "exception": "System.InvalidOperationException",
    "callstack  ": "at Foo.Bar() line 350...at Program.Main() line 15",
    "code": "CS1501"
  },
  "telemetry": {
    "Roslyn.LsifGenerator.IsDone": false,
    "Roslyn.LsifGenerator.PercentageDone": 50
  }   
}

Log supports the following parameters:

severity: This is case sensitive and can be "Info", "Warning", or "Error".
exception: This is a language specific exception or error code name.
code: This is a language or tool specific error code or failure name. This is used as a bucketing parameter for diagnosis and can be populated in whatever way makes sense for Roslyn.
callstack: This is the language specific call stack, stack trace, backtrace, or other diagnostic info.

Note that no explicit limits are placed on length of each of these parameters, but the consumer may truncate them, if they exceed more than a few thousand characters.

Other commands

Currently none though this is left as an open-ended protocol in case we need a way to facilitate future LSIF tool => consumer communication for diagnostic purposes, like:

Reporting or triggering a fail fast.
Triggering collection of additional diagnostics or machine scoped configuration settings.
Measuring duration of operations.
Triggering ETW or dump collection, implemented by the consumer.

Roslyn specific implementation

The overarching goal of this work item is to enable automated aggregation of:

Any potential LSIF-generation-blocking failure.
Any exception or warning that may be indicative of a failure.
Performance metrics - what part of generation takes the longest
Diagnostic and verbose log output that may help with understanding major branches in generation logic that would enable one to better triage issues.

Cost Considerations for Logging

My goal is for this logging to be fairly verbose. Anything that might be of value in diagnosing and triaging failures should ideally be written using this logging mechanism and aggregation would happen in the consumer prior to transmission. e.g.: it should be ok to invoke log a few dozen or even a hundred times, so long as writing to STDERR is itself not a bottleneck.

Sample Metrics

I'm looking for 'guard rails' that may indicate something went wrong at this stage. Here are some examples:

Documents count
Projects count
Documents with errors
Issues or diagnostics discovering and utilizing a particular version of .NET SDK, MSBuild, Nuget package, or tool.

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

CyrusNajmabadi commented 1 year ago

Can you add more info on what the "Standard error protocol" is?

gundermanc commented 1 year ago

Can you add more info on what the "Standard error protocol" is?

Done.

FYI @jasonmalinowski

dotnet / roslyn

LSIF generator should implement 'standard error protocol' and report exceptions, health, and performance metrics #69610