dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.79k stars 3.19k forks source link

Bulk import for efficient importing of data from the client into the database #27333

Open roji opened 2 years ago

roji commented 2 years ago

Databases usually provide various mechanisms for efficient bulk import of data from the client (e.g. .NET application) into the database. We already have issues tracking improving INSERT performance (#15059, #9118, #10443), but bulk import is a specialized scenario where the native database mechanisms work more efficiently (and often considerably so) than multiple INSERTs. Note also that this is different from bulk copy of data across tables within the database (INSERT ... SELECT, tracked by #27320).

Bulk import only allows targeting a table, so a simple method on DbSet should suffice (LINQ operators don't make sense here):

ctx.Blogs.ImportFrom(blogs);

ImportFrom should accept an IEnumerable parameter; ImportFromAsync should have overloads accepting both IEnumerable and IAsyncEnumerable parameters. The method would pull entity instances and import them into the database table via the provider-specific mechanism.

Additional notes

Database support

Community implementations

AndriySvyryd commented 2 years ago

We might need an overload with anonymous types for shadow properties.

Note that this would also allow inserts for keyless entity types. So we might need a non-collection overload too.

roji commented 2 years ago

We might need an overload with anonymous types for shadow properties.

Yep..

Note that this would also allow inserts for keyless entity types. So we might need a non-collection overload too.

Can you elaborate?

AndriySvyryd commented 2 years ago

Can you elaborate?

Just some sugar:

ctx.KeylessBlogs.ImportFrom(new KeylessBlog());
roji commented 2 years ago

Ah I see - you're saying this would be the only way to insert keyless entity types (because we don't support them in Add/SaveChanges), so it makes sense to allow "bulk import" of a single instance... Right.

roji commented 2 years ago

Note: we have #9118 for optionally not tracking after SaveChanges, which would unlock using SqlBulkCopy for standard SaveChanges. At that point, the advantage of a dedicated bulk import API (this issue) becomes bypassing the change tracking machinery, which may or may not be worth it.

yzorg commented 2 years ago

Vote for taking a very close look at EFCore.BulkExtensions. The most underrated .net OSS libraries I use.

It already respects EFCore table name mappings and column/property mappings. It supports all the "batch" scenarios: "upsert", insert only (only new), and "delete if missing" (MERGE with DELETE clause). I've needed all those scenarios. All way faster than current EFCore for 10k+ rows (aka importing a full dataset daily/monthly/x). It supports pulling back new identity values into entities, but off by default, which I think is sensible for bulk scenarios. But if you can reuse roji's RETURNING work here might speed it up.

roji commented 2 years ago

@yzorg thanks, yeah - we're aware of EFCore.BulkExtensions.

The plan specifically in this issue is only to wrap the database's native bulk import mechanism - SqlBulkCopy for SQL Server, binary COPY for PostgreSQL, etc. These lower-level API generally allow simply copying in large quantities of data into a table in the most efficient manner possible.

Various "merge" functionality such as upserting, deleting if missing, etc. are higher-level functionality which generally isn't covered by a database's bulk import mechanism, and are implemented by using additional database functionality. For example, it's typical to first use bulk import to efficiently copy data into a temporary table (e.g. via SqlBulkCopy), and then use MERGE to perform upsert/delete-if-missing/whatever between the temp table and the target table.

Such "compound" operations would be out of scope for this specific issue, but it would provide the first building block (bulk import). You could then use raw SQL to express whatever MERGE operation you want (we could possibly even provide an EF API for that, though I'm not sure it would be very valuable).

khteh commented 1 year ago

https://www.npgsql.org/doc/copy.html#binary-copy is rather crude. I need to ingest tens of thousands of record from Excel sheet into the DB without any duplicate.

roji commented 1 year ago

@khteh in what way is Npgsql's binary copy support crude? Deduplicating isn't a concern of bulk import in itself - that's something that needs to be done at a higher level (and therefore wouldn't be covered here in any case). More details on exactly what you need could help clarify things.

alrz commented 2 weeks ago

Would love to see this support IAsyncEnumerable as the input, in case we're pulling from another data source like mongo into sql.