dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.64k stars 3.15k forks source link

Future NativeAOT work #34446

Open roji opened 4 weeks ago

roji commented 4 weeks ago

NOTE: this is very incomplete, am just starting to chart out future work here

Substantial work was done in EF 9, see #29754.

Query

Compiled model size/perf

Our current compiled model is very big (#33483), causing the application to be both big in terms of code size (especially for AOT), and also to start up slowly (the main cause is probably that all that code needs to be JITted, but even in AOT this should have a non-trivial impact).

Size reduction

Tooling

Possibly out of scope

AndriySvyryd commented 3 weeks ago

Possibly experiment with non-code-based approaches to the compiled model, e.g. use some serialization format instead, such as MsgPack or similar.

Note that one of the functions of the code-based compiled model is to provide hints for the trimmer for code that shouldn't be removed. This effect might be hard to replicate accurately otherwise.

roji commented 3 weeks ago

@AndriySvyryd let's discuss this in-depth when we start on concentrating on the EF 10 plan... Moving away from C# in the compiled model is definitely not an easy/light thing - I just want us to possibly do an experiment to see where the challenges are, to what extent it fixes our problems in this area, etc.

mateli commented 1 week ago

Possibly experiment with non-code-based approaches to the compiled model, e.g. use some serialization format instead, such as MsgPack or similar.

I had a similar performance problem with EF and at first I tried reading data directly into DataTable and DataFrame, both being somewhat faster than using EF but used significantly less memory which were my first goal as no other optimization is relevant if you are waiting for swapfile operations. Then I created a more simple solution where I read directly from the data source as both feeding a DataTable and DataFrame does internally. First I read out the column names and datatype and create a List for each and some code to treat it as columns and rows to make it more similar to DataTable and DataFrame. This very simple approach turned out to be much faster and I guess that this is due to List being a rather thin wrapper for Arrays. I did try both with just creating plain lists and pre-allocating space for the rows needed but this did not have significance performance advantages. I know that List increases array size by copying all data to a larger array however I assume this was not a significant bottleneck in my experience.

In this use case many columns needed to be converted but I found that doing this as a second step were faster than doing it while reading from the database. Especially when I could use System.Numerics for SIMD based conversion of for example float to double which is much faster than just casting them. Obviously these were some basic experiments and converting after read may not be optimal for all data types. I think that one key factor why this was fast for me is that by using a List class which uses an Array as backing storage a column is aligned in memory which is usually good for cpu cache optimization. Also using System.Numerics performs fewer memory operations as chunks of data are moved to/from registers.

Another reason why I used List's are because I were storing data to disk in the Parquet format for processing in other languages and I figured out that just dumping arrays on the Parquet.Net library were significantly faster than persisting a DataTable trough the libraries built in function for that. This of course because the DataTable class do not have an efficient way to convert columns to arrays while my List-based column were already stored as arrays. Any storage solution that stores tables column-wise with column data put into blocks is significantly faster if it is reading directly from an Array. In some cases such a write operations may be handled by DMA with minimal CPU processing. The same goes of course for streaming data over any channel that has DMA support such as most modern network adapters and even many serial ports used in embedded systems.

Finally I manually wrote wrappers both for this storage backend and for my EF backend based on common interfaces so that I could use them both in an EF-like way and have interoperability. Meaning reading a row generated an object that after changing values could be persisted using EF. In fact I wrote three wrappers for my solution, DataTables and EF so that I could read data all three ways and compare the result as a method of testing all of this.

Obviously on a memory constrained system the convert after read approach may be suboptimal and it may be better to not have intermediate tables storing the raw data. But as this solution moved my system from being memory constrained to cpu constrained the solution above were faster for me.

A similar approach could be used inside EF. Instead of creating an object for every db row a more compact storage can be used. A simple layer without change tracking or anything else than efficient data storage. It may even be more efficient to make this storage immutable rather than to figure out how to handle changes in the data. Then EF can sit on top of that and only create objects when needed. Changes can be put in another similar storage. This would mean that for many queries no row wrapping objects need to be created and a simple fetching of a table to a collection object may not need to take 10 times more memory than storing columns in arrays.

This is mostly speculation at this point but I do believe that by separating actual in-memory storage from the object model there are huge gains to be made. Furthermore an extensive optimized array conversion library that for numeric values depend a lot on System.Numerics can probably increase performance even more.

On memory constrained systems a serialized format could use a lot less memory but on other systems it could limit the use of high performing things like using System.Numerics on the data. However for data like strings it is probably more efficient to store an array of strings in a compact format as bulk string operations are usually memory constrained. If a string can be red with minimal memory access and decompressed into cpu cache that may be significantly faster than reading raw strings. Also in some cases a string column may be storing foreign keys in which case a few strings may be repeated over and over, such columns can benefit a lot by being converted to a list of integers with a dictionary tying all integers to a corresponding string value.