dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.25k stars 4.73k forks source link

NativeAOT type instantiation sharing #83318

Closed NinoFloris closed 1 year ago

NinoFloris commented 1 year ago

In similar vein of https://github.com/dotnet/runtime/pull/82591, would it be possible to reduce duplication when all type parameters on a type are not relevant for type layout/size?

As an example

// or struct
class Wrapper<T>
{
    public List<T> Value { get; }
}

As I understand it today this produces a new instantiation per value type for T. I understand .NET isn't (immediately) going to support gc shapes (in the golang generics sense) as a size optimization. However this particular case seems like an analysis that can be done at the template level which can 'significantly' reduce bloat.

Is there any appetite for optimizations like these?

@MichalStrehovsky

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 1 year ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

Issue Details
In similar vein of https://github.com/dotnet/runtime/pull/82591, would it be possible to reduce duplication when all type parameters on a type are not relevant for type layout/size? As an example ```cs // or struct class Wrapper { public List Value { get; } } ``` As I understand it today this produces a new instantiation per value type for T. I understand .NET isn't (immediately) going to support gc shapes (in the golang generics sense) as a size optimization. However this particular case seems like an analysis that can be done at the template level which can 'significantly' reduce bloat. Is there any appetite for optimizations like these? @MichalStrehovsky
Author: NinoFloris
Assignees: -
Labels: `untriaged`, `area-NativeAOT-coreclr`
Milestone: -
MichalStrehovsky commented 1 year ago

People rely on this for perf optimizations. E.g. struct that implements an interface and then using that struct with a constrained generic T : IInterface method, expecting to get optimal inlining.

NinoFloris commented 1 year ago

I wouldn't want this to happen for methods, only for the type side. As I understand it this still adds overhead for bookkeeping that is basically entirely duplicate.

MichalStrehovsky commented 1 year ago

If the struct is not boxed, the type pretty much disappears after AOT compilation.

If it is boxed, the identity is observable through is checks or object.GetType. We cannot fold it - it needs a unique identity.

NinoFloris commented 1 year ago

The reason for this issue is the following mstat report in Slon https://github.com/NinoFloris/Slon/actions/runs/4400673130

Slon is essentially a greenfield Npgsql, which I'm using to prototype improvements that flow back into Npgsql. To support postgres features like user defined database types and arrays, ranges etc (composition) these drivers ships with an extensible serializer. As a result we face many of the same challenges as STJ.

The interesting type that my example was modeled on is defined here https://github.com/NinoFloris/Slon/blob/main/Slon/Pg/PgConverterInfo.cs#L263-L280

This type is IMO the definition of such a data carrying type that never actually deals with T directly.

Yet here under "Top 20 Slon Types By Size" it's seen as many times as there are rooted PgConverterResolution's types closed over value type T's. I don't think we're actually boxing this particular type anywhere either.

Screenshot 2023-03-13 at 03 08 01

Additionally under "Top 20 Slon Classes By Methods Size" you can find method sizes for

Screenshot 2023-03-13 at 02 43 33

I'm a bit at a loss why __GetFieldHelper is as big as it is (it also grows per instantiation)

This is basically what I hope can be reduced as this brings a data organization tax where even bundling values under a nice nominal type causes bloat (as opposed to pushing them into a ValueTuple which wouldn't duplicate as PgConverter<T> is a class).

(Effectively what I'm asking for is a level of indirection such that the bulk of the instantiation data can be shared, only an nint for the shared data + nint * type arguments? stub would remain unshared to handle type identity when the type is boxed)

MichalStrehovsky commented 1 year ago

It indeed doesn't look like PgConverterResolution is getting boxed anywhere - the reason why there's still the extra metadata around the type is because it's a generic valuetype instantiated over reference types. Reference type instantiations do get shared but in order to know what the actual T is at runtime, an extra argument to the method needs to be provided. This extra argument is the MethodTable (which is the 112-byte thing you see).

I have a WIP change that should be able to get rid of them. I'll try to finish it tomorrow: https://github.com/MichalStrehovsky/runtime/commit/530b76b6e4a65b85538a5bee56f68d65029a0528

I'm a bit at a loss why __GetFieldHelper is as big as it is (it also grows per instantiation)

The method supports calling Equals/GetHashCode on the struct at runtime. There is a method body per each non-reference instantiation and the size of the individual method bodies is proportional to the number of fields on the type.

NinoFloris commented 1 year ago

That's fantastic @MichalStrehovsky, really cool to see the work.

So TypeDictionaries were already a thing for R2R?

Another relevant question would be whether Equals/GetHashCode should even stick around if the type is never boxed? (if it's never boxed we should be able to analyze whether Equals/GetHashCode is ever called on the T directly right?) Would the current alternative be to override these methods and leave them empty/throw?

Additionally, and I know this comes back to complicated sharing again, would it in theory be possible to analyze the body and determine T is not actually used so all instantiations could share the code?

MichalStrehovsky commented 1 year ago

Would the current alternative be to override these methods and leave them empty/throw?

No, that will be worse for size. The Equals/GetHashCode support is added because the compiler thinks it needs a full datastructure that describes the type. The TypeDictionary is stored in the virtual method table, along with the virtual methods. The fix I have would remove the need for both.

MichalStrehovsky commented 1 year ago

Do you have a sample where this saving would be meaningful? I tried the NativeAOT sample in the slon repo but that one only has one type we get rid of and the saving is not very meaningful (can't justify 140 lines of code in the compiler to maintain).

NinoFloris commented 1 year ago

Ah I see, it's in the following PR https://github.com/NinoFloris/Slon/pull/1. Rooting happens here https://github.com/NinoFloris/Slon/pull/1/files#diff-0b1de98cf568fad82059feddd85fa310dc9f1b528b9dff1ba74eb7e7fe5ae07cR315

We root a lot of types by default to make sure users can actually use these in their aot apps without encountering errors. Given these conversions can be extended or overridden there isn't a static path from say DbDataReader.GetFieldValue<int[]> to the actual construction of PgConverterInfo.Create(new ArrayConverter<int>(new IntConverter(), DataTypeNames.Int.ToArrayName()) that would allows to do this generically.

With ADO.NET being an abstraction we can't easily introduce STJ like api's which would allow consuming libraries or users to root them explicitly.

public static string Serialize<TValue>(TValue value, JsonTypeInfo<TValue> jsonTypeInfo)

So we compromise by manually rooting a set of converters that we feel is a good default for postgres. Most of these converters deal with structs and their various collection and System.Nullable forms. All of these add to PgConverterResolution duplication.

The PR I linked to doesn't even root half of the types Npgsql would ship out of the box with, so I wouldn't take it as the maximum expected savings either, it's more of a guiding measure for me to optimize the bloat we create.

MichalStrehovsky commented 1 year ago

Thanks! So this is a different problem. My fix would only help for those converters that are over reference types PgConverterResolution<string>, etc. The savings from my fix are minuscule.

The reason why these converters are generated is reflection:

https://github.com/NinoFloris/Slon/blob/00d082db53ffc18e97ef44a149bf371122f59f96/Slon/Pg/Converters/ArrayConverter.cs#L282-L286

The analysis in the compiler finds out we're reflecting over a constructor here. ArrayConverter<T> has a constructor that has PgConverterResolution<T> in the signature. The reflection implementation needs a full data structure describing all parameters, so we generate it, so that one can reflection-call the constructor.

The only way to get rid of the PgConverterResolution instances is to do any one of these:

NinoFloris commented 1 year ago

OK yeah the reflection is not strictly necessary at all, it's essentially old/dead code. I've removed it from main. Thanks for taking a look btw, it's appreciated.

Now in the PR after rebasing on the new main I do see quite a difference (so that's really good!) but still there are about 7 matches of PgConverterResolution1`, do those stick around because reference types are not being shared?

What tools do you use to root cause these issues, I'd like to be able to do it myself.

MichalStrehovsky commented 1 year ago

I used https://github.com/dotnet/runtime/tree/main/src/coreclr/tools/aot/DependencyGraphViewer.

Run the tool, click the "Use ETW events" and dotnet publish your app as usual. You'll see two graphs showing up. Double click the second one. Search for PgConverterResolution.*constructed (it's a regex). Then you can double click on the results to explore the dependencies - it's essentially a graph viewer, so it shows you incoming edges in the upper part (things that brought in the thing you're looking at), and outgoing edges (what this thing depends on).

Sometimes it will just tell you the reason something was added is "reflection". Then repeat the process with the first graph. The first graph has the whole program analysis before code generation and can often have more details.

NinoFloris commented 1 year ago

Good to know, it's just too bad I'm on macOS. Ideally I don't have to boot a VM and sync a project just to do it. I see there are some other tools for inspecting dgmls but they'll probably only deal with the 'first graph' as the second needs ETW?

MichalStrehovsky commented 1 year ago

The readme also describes a second approach - add <IlcGenerateDgmlFile>true</IlcGenerateDgmlFile> to the project file. This will generate the two XML files that the tool can also open (the "scan" one is whole program analysis, "codegen" one is the code generation one).

You'll still need a way to run a WinForms app though because the tool is based on WinForms. Wine would probably work on Linux - I don't know if there's Apple equivalent.

There's also https://github.com/dotnet/runtime/tree/main/src/coreclr/tools/aot/WhyDgml but this tool only shows you a path from a specific node name in the XML to the root. It also works. It's command line based.

NinoFloris commented 1 year ago

The readme also describes a second approach

I noticed, does the dgml meaningfully differ in some way from the ETW events as far as you know?

https://github.com/dotnet/runtime/tree/main/src/coreclr/tools/aot/WhyDgml

Yup thanks, trying that one right now!

MichalStrehovsky commented 1 year ago

I noticed, does the dgml meaningfully differ in some way from the ETW events as far as you know?

The DGML only captures the first edge leading to a vertex. I.e. if X is referenced by A and B, the ETW events will show you both A and B, but DGML will only show A. If your goal is to get rid of X, you can still achieve that with DGML. First you get rid of A and then you get rid of B. Often times this makes the job easier because there's less noise.

NinoFloris commented 1 year ago

Thank you for your help so far, I've come quite far in reducing bloat while expanding the types we support.

It's kind of crazy to see the impact on binary size of even the simplest async use when combined with a good number of instantiations over value types. For instance this 'simple method' came up https://github.com/NinoFloris/Slon/blob/bed56bb5b190517073f475f5cc441b51aa46d965/Slon/Pg/PgConverter.cs#L120-L129

Screenshot 2023-03-20 at 05 12 40

With the size for this one method, ReadAsObject and its codegenned constituents, basically taking up 2/3s of the summed instantiations. Add in the overrides that we need to add for PgFixedBinarySizeConverter to do internal buffering and it starts to add up tremendously for converters that shouldn't ever need to do their own IO.

It's very clearly forcing us to create a much deeper inheritance hierarchy (generally a no-no) that segments apis based on the capability to go async. Most converters should derive from a buffered base and only in the exceptional cases do we derive from the other streaming base that has the async apis.

Any converters that may need to do both - like composing converters for arrays etc - as a result need to derive from the most capable base. When they should call into the async apis they then do runtime type checks to prioritize the streaming base. It wouldn't even be safe to call into the sync apis for these converters as they are reused to provide sync IO implementations instead.

If we then want to really be frugal we might want to bifurcate those composing converters as well to something like StreamingArrayConverter<TElement> and BufferedArrayConverter<TElement> but you can see how this starts to take an increasing toll on the implementation complexity.

Let's say it works, but all of it is hardly seamless. Are there any tricks I've missed? Are there plans to improve the situation around async bloat in general?

I have one related question that came up during analysis. I was doing some cross reference of the codegen DGML and an mstat report when I noticed a type Int32Converter<T> with two entries, one of 160 bytes for Int32 (which I assume to be the MethodTable) and another entry of 32 bytes.

I can't figure out what that second entry is for, not from the DGML at least (I do see things like "NecessaryType for constructed type" or "Optional fields" come up, is there some glossary of what these stand for?).

What kind of metadata is that small?

This is the relevant one under "Top 200 Slon Types By Size" which lists the instantiations and smaller bits: https://github.com/NinoFloris/Slon/actions/runs/4464105683

NinoFloris commented 1 year ago

Looking a bit more I'm seeing something else in the following report I'm curious about https://github.com/NinoFloris/Slon/actions/runs/4470515666

Looking at all these types, sure it's only 188 bytes per, but why does each reference type have its own entry besides __Canon? Is it effectively only the methods that are shared?

Screenshot 2023-03-20 at 17 40 13
agocke commented 1 year ago

@NinoFloris I'm not sure how your serializer system works, but you might also be interested in https://github.com/agocke/serde-dn. My aim in the project there isn't to be particularly efficient in minimizing size-per-type, but the goal is to get one "serialize" implementation per type in the vast majority of cases, which would mean that there's lots of sharing across the stack, and you only pull in what you actually use.

NinoFloris commented 1 year ago

@agocke definitely interesting work, I've seen the library in rust, not a bad approximation! I might even prototype some part in our internal codebase to use this instead of the STJ serializer, just to see how it goes.

For the driver level serialization I'm working on it won't be an easy fit though, prime example being that we do quite a bit of streaming reading/writing of variable length column data, which as far as I can see wouldn't be possible.

Additionally we have to deal with the unfortunate reality that db nulls pose (the protocol sends these as a negative column length). We handle those on the outside so not all our value type converters have to be defined over Nullable (users are also expected to call IsDbNull before GetInt on DbDataReader if they expect db nulls).

Some of these could maybe be overcome by different factoring but it all makes for an awkward fit with generalized serializer frameworks...

agocke commented 1 year ago

For the driver level serialization I'm working on it won't be an easy fit though, prime example being that we do quite a bit of streaming reading/writing of variable length column data, which as far as I can see wouldn't be possible.

This is pretty interesting. Do you have an example?

NinoFloris commented 1 year ago

Npgsql has a ring buffer for reading and one for writing. This means that for instance to read strings larger than this buffer size we stream the bytes through an encoder to get chars to build a string with. For writing this amounts to a flush each time we fill up the buffer with utf8 encoded bytes.

If we couldn't do this we would need to read the entire thing into memory at once, encode it in one go and make a string that way, quite a bit less efficient. For writing it would preclude pass through streaming of say json from http requests.

Npgsql also support TextReader and Stream as column readers but in these cases we internalize all the streaming into the stream implementation so that would work without streaming at the serializer level.

Finally as pg has support for composite types like arrays, multiranges, and records (and arrays of these) each of these values could be a large value again. Forcing some form of streaming support throughout the stack.

As you probably know similar support for streaming exists in STJ, it's just not publicly extensible.

agocke commented 1 year ago

Thanks! I'll have to think more about that scenario. I've thought about making an async version of deserialize that would support streaming, but when I've prototyped it I've found that the fixed overhead of async makes the non-streaming versions slower.

MichalStrehovsky commented 1 year ago

The costs of async are known and there has been some thinking whether we could move more of this support into the runtime. There's not much we can do about this.

Otherwise I don't see much actionable here, so I'm going to close this. Sizoscope tool can be used to investigate why something was generated: https://github.com/MichalStrehovsky/sizoscope. If there's something suspicious, feel free to open issue on that thing.