Copy/upsert data across tables in bulk (INSERT ... SELECT)

roji commented 2 years ago

This issue tracks introducing an API to copy data inside the database, from one table (or several) into a destination table. Note this is different from bulk importing data from the client into the database (#27333).

This was split off from #795 (bulk update/delete).

Basic API

All SQL database support a variant of the INSERT statement which accepts a query instead of a list of values:

INSERT INTO x SELECT * FROM y

The column list can be specified (INSERT INTO x (a, b) SELECT ...) or omitted (INSERT INTO x SELECT ...). If it's omitted, the subquery must return the exact number of columns in the destination table, with the correct type. Since it's problematic to rely on table column ordering (e.g. can't be changed after creation), we should probably force the user to always explicitly provide the column list.

Basic proposals:

// Variant 1: begin from the source table, flow data through the LINQ operators to the destination.
// The naming corresponds to the SQL (`INSERT INTO`), no ambiguity with change-tracking operations on DbSet (like with Update).
ctx.Blogs1.Where(...).InsertInto(ctx.Blogs2, b => b.Name)

// Variant 1.1: Naming-wise, we could make it extra-explicit that it's a bulk operation.
// Also more consistent with BulkDelete/BulkUpdate (with or without `Into` suffix).
// May be slightly ambiguous with bulk import (from client).
ctx.Blogs1.Where(...).BulkInsert(ctx.Blogs2, b => new { b.Name, b.Url })

// Variant 2: we can flip the order, but this adds nesting which seems unnecessary:
ctx.Blogs2.InsertFrom(ctx.Blogs1.Where(...), b => new { b.Name, b.Url })

Static column compatibility checking

It would be great to statically enforce that the column list matches the incoming columns from the source table, e.g. with the following signature:

public static void InsertInto<TSource, TDestination>(
    this IQueryable<TSource> source,
    DbSet<TDestination> destination,
    Expression<Func<TDestination, TSource>> columnSelector)
    where TDestination : class
{
}

This works great for a single column:

ctx.Blogs.Select(b => b.Name).InsertInto(ctx.Customers, c => c.Foo1);

With multiple columns, this fails if the anonymous type's field names differ:

ctx.Blogs.Select(b => new { b.Name, b.Url }).InsertInto(ctx.Customers, c => new { c.Foo1, c.Foo2 });

Requiring the source's and column's anonymous types to have the same field names seems... problematic (we really do want to project across different columns).

If we had value tuples in expression trees (yet again), this would work quite well:

ctx.Blogs.Select(b => (b.Name, b.Url)).InsertInto(ctx.Customers, c => (c.Foo1, c.Foo2));

In any case, if we don't want this to depend on value tuple syntax, we could give up static-time enforcing with the following signature:

public static void InsertInto<TSource, TDestination, TColumns>(
    this IQueryable<TSource> source,
    DbSet<TDestination> destination,
    Expression<Func<TDestination, TColumns>> columnSelector)
    where TDestination : class
{
}

... and the query would fail at runtime if things are mismatched.

Finally, note that if we want to, we could have a specific overload for copying between tables mapped to shared type entity types - in this case no column list is necessary:

ctx.Blogs1.InsertInto(ctx.Blogs2);

Fancier examples

// With navigation:
ctx.Blogs1
    .Where(b => b.Posts.Any())
    .Select(b => new { Blog = b, FirstPost = b.Posts.OrdersBy(p => p.Popularity).Take(1) })
    .Select(x => new { x.Blog.Url, x.Blog.Name, x.FirstPost.Title, x.FirstPost.Author })
    .InsertInto(ctx.Foo, f => new { f.Url, f.Name, f.Title, f.Author });

// Fetching generated columns back (RETURNING/OUTPUT):
var ids = ctx.Customers
    .Select(...)
    .InsertInto(
        ctx.Blogs,
        columnSelector: b => new { b.Name, b.Url },
        returningSelector: b => b.Id)
    .ToList();

// Insert or ignore:
ctx.Customers.Select(...).InsertIntoOrIgnore(
    ctx.Blogs2,
    insertColumnSelector: b => new { b.Name, b.Url },
    uniquenessColumnSelector: b => b.Name);

// Insert or update (UPSERT). The update action can access both the existing row and the excluded (which we attempted to insert):
ctx.Customers.Select(...).InsertIntoOrUpdate(
    ctx.Blogs,
    insertColumnSelector: b => new { b.Name, b.Url },
    uniquenessColumnSelector: b => b.Name,
    updateAction: (existing, excluded) => new Blog
    {
        Name = existing.Name + "_updated",
        Url = excluded.Url
    });

// The same UPSERT with generated column fetch:
ctx.Customers.Select(...).InsertIntoOrUpdate(
    ctx.Blogs,
    insertColumnSelector: b => new { b.Name, b.Url },
    uniquenessColumnSelector: b => b.Name,
    updateAction: (existing, excluded) => new Blog
    {
        Name = existing.Name + "_updated",
        Url = excluded.Url
    },
    returningSelector: b => b.Id);

// With sproc as input on SQL Server (generates `INSERT ... EXECUTE`):
ctx.SprocSet.InsertInto(ctx.DbSetMappedToSproc, b => new { ... });

// With common table expression (depends on #26486):
EF.Functions.With(...).InsertInto(ctx.Blogs, b => new { ... });

Additional notes

SQL Server supports INSERT ... EXECUTE for copying the results of a sproc.
All providers support WITH with INSERT (WITH ... AS x INSERT INTO Y SELECT * FROM x) - this is important for recursive WITH queries. PostgreSQL even allows capturing the results of an UPDATE ... RETURNING with WITH, and inserting them into a table.
Just like with bulk update/delete, INSERT supports RETURNING/OUTPUT which returns columns; this is useful to get back auto-generated columns which aren't updated by the query (e.g. the ID). So like for update/delete, we'd have an overload which accepts an expression to determine the columns to return, and returns an IQueryable with those columns.

Documentation

Community implementations

roji commented 2 years ago

Updated the above with both fetching back generated columns (RETURNING/OUTPUT), and how UPSERT would fit into this API ("bulk upsert").

UPSERT implementations across databases generally accept the following information:

Conflict action - either ignore or update. We should probably have two methods to represent this (InsertIntoOrIgnore, InsertIntoOrUpdate)
Columns that make up the uniqueness check, or match condition (usually expected to correspond to a unique constraint). In some cases (PG, SQLite) the constraint name can be provided directly.
- MySQL doesn't require or support this - violation of any unique constraint always triggers the update action (it's discouraged to use UPSERT in MySQL on tables with more than one unique constraint). MySQL can simply ignore this parameter (and even introduce an overload without it).
If doing insert or update, the SET part - this is very similar to how bulk update works (UPDATE ... SET ...). Note that in most databases it's possible for SET to refer to both the existing row and the excluded incoming one.

AndriySvyryd commented 2 years ago

Looks good. We'll probably need overloads for types without a DbSet:

ctx.Blogs1.Where(...).InsertInto<Blogs2>(b => b.Name);

ctx.Blogs1.Where(...).InsertInto<Blogs2>("STETBlog", b => b.Name);

borisdj commented 2 years ago

Just to add for cmnt list EFCore.BulkExtensions now has SourceTable feature with config CustomSourceTableName and also supports optionally dict CustomSourceDestinationMappingColumns. Usage example: context.BulkInsert(new List<Item>(), b => b.CustomSourceTableName = "ItemBase");

roji commented 1 year ago

Another use case that may be covered here is having client-provided data as the source, e.g. provide an array of .NET instances and upsert them into the target table in the database. With MERGE this looks like this:

MERGE TheTable t
USING @data d
ON t.Id = d.Id
WHEN MATCHED THEN 
    UPDATE SET [Text] = d.[Text]
WHEN NOT MATCHED THEN 
    INSERT (Id, [Text]) VALUES (d.[Id], d.[Text]);

The API shape above may not be suitable for this, since it starts with a DbSet as the source, but we want to send client data. An interesting idea is to add an API to create a (temporary) DbSet out of a collection of client-side instances, which we can then e.g. upsert into a table in the database.

Note that for SQL Server, the above requires a custom type, which is a problem (but PG should support this without one, as tables have types created for them implicitly).

AraHaan commented 1 year ago

Question, would this proposed API allow using this to insert items (from any linq query) in bulk without tracking them? I have ran into a use case of such a thing when running my Discord Bot in a 2 GB RAM ubuntu VPS and it ends up OOMing fast (which I think the change tracker contributes to the problem on top of Remora.Discord's caching).

roji commented 1 year ago

@AraHaan no - this is about copying data from one table to another. Bulk import from the client is covered by #27333.

In the meantime, you can drop down to your driver's low-level import mechanism (e.g. SqlBulkCopy for SQL Server).

voltcode commented 1 year ago

Is this planned for ef core 8 ?

ajcvickers commented 1 year ago

@voltcode This issue is in the Backlog milestone. This means that it is not planned for the next release (EF Core 8.0). We will re-assess the backlog following the this release and consider this item at that time. However, keep in mind that there are many other high priority features with which it will be competing for resources. Make sure to vote (👍) for this issue if it is important to you.

stevenpua commented 8 months ago

Greetings all! How do we upvote this? This makes a lot of our app database provider agnostic. We no longer need to do ExecuteSql when this is ready. How will this work with the non-sql providers? (or no support for those?)

Leaving this comment to upvote this.

roji commented 8 months ago

@stevenpua you upvote issues by upvoting the top-most comment above (the issue description) - posting comments is not something that helps us prioritize, and creates needless notifications and churn.

roji commented 8 months ago

Re no-sql providers, the API here will likely be abstract and would be implementable by any provider type, so at least in principle, any database that supports a copy-across-tables mechanism would be covered. However, this is typically something that no-sql databases don't necessarily implement.

dotnet / efcore