Improve testing related to tracer / dependency tracking

jtschuster commented 1 year ago

As we create more tools for investigating size after trimming to answer "why is this kept/not kept," it's crucial to improve tracing / dependency tracking. Right now, we only have the most basic testing and there are lots of methods in MarkStep that start with if (Annotations.IsMarked(thing)) return; without tracking the dependency passed in. This early return helps performance of linking but will only mark with a single reason, and will lead people to believe that if they remove the first dependency, they'll be able to remove thing, even if it's not true.

My first thought is to replace KeptAttribute with KeptByAttribute(string MetadataProvider, DependencyKind) to indicate an item should be marked by a specific type/method and a specific reason. I've found a couple places where an item was marked as expected, but due to unrelated/buggy logic, and hopefully this would help catch it.

marek-safar commented 1 year ago

I did a version of this in #1228

jtschuster commented 1 year ago

@agocke and I spoke a little about this and debated the benefits of recording every single dependency that marks a node vs. only recording the first dependency.

Marking first dependency:

Pros:
- Avoids cycles in the dependency graph
- Creates a minimum spanning tree of the dependencies that can be easier to reason about
Cons:
- Doesn't give a full picture of all the nodes that are marking a node (you could remove node S creating the dependency that keeps node T, re-link, and find another dependency keeping T)
- Makes it harder to test that a certain dependency is keeping a node (requires the test to be constructed so that there's only the one dependency keeping it)

Marking all dependencies:

Pros:
- Gives a more complete picture
- Makes it easier to test that a certain dependency is keeping a node
Cons;
- Can be very cluttered with lots of noise
- Cycles can make it harder to traverse up to find the "top level node" that is causing a node to be marked

I'm still in favor of marking all dependencies, and I think we could add a little extra information to be able to construct a tree from the graph and avoid cycles, like the order the linker came across the dependencies. We also could put the burden on the visualizers to come up with a solution for investigating a graph with cycles.

vitek-karas commented 1 year ago

I don't think linker should be the one to reason about what dependency is important - it doesn't know... and doesn't care. The tool should simply dump all dependencies it knows about. The consumers of the output should then apply whatever logic makes sense to turn it into usable data shape.

We need the system to be:

Almost nonexistent if tracing is turned off (vast majority of linker runs will not have tracing turned on, so no perf cost)
Reasonably efficient if it's on, but it should not prefer speed over data completeness

If later on we find that it makes sense to look at the data with the "first dependency only" view we can create a helper library to process the dumped data into that shape. (If we have more than one user of the data, it would make sense to have such library anyway, to deal with parsing the file format and mapping the strings back onto some kind of object model - maybe even type system eventually).

agocke commented 1 year ago

That makes sense to me as a design goal, but if the system currently produces information that's good enough I'm not particularly inclined to improve it until we have a specific request to do so.

vitek-karas commented 1 year ago

I didn't mean to imply that we should do anything about it. Just a guiding principle. I agree that if it currently works, no need to change anything.

MichalStrehovsky commented 1 year ago

Almost nonexistent if tracing is turned off (vast majority of linker runs will not have tracing turned on, so no perf cost)

This is how tracing in ILCompiler.DependencyAnalysisFramework is implemented e.g. here's the first edge logging strategy: https://github.com/dotnet/runtime/blob/82aa87f9cdf9ac20f0f685016648c36247a76ff1/src/coreclr/tools/aot/ILCompiler.DependencyAnalysisFramework/FirstMarkLogStrategy.cs and here's a no-op logging strategy: https://github.com/dotnet/runtime/blob/82aa87f9cdf9ac20f0f685016648c36247a76ff1/src/coreclr/tools/aot/ILCompiler.DependencyAnalysisFramework/NoLogStrategy.cs

Notice these are structs - we then instantiate a generic type over these structs. Because of how generic specialization works, this means that all of this is 100% inlineable. The no-op logging strategy has zero impact (not even an if check anywhere but at the beginning, when we're creating the graph/analyzer).

dotnet / linker

Improve testing related to tracer / dependency tracking #3078