Batch inserts - Githubissues

MikaelEliasson commented 9 years ago

One cool thing EF7 will do is to use OUTPUT in Sql Server to support batch inserts even with db generated ids. I have not investigated if any other engines support it or have similar.

Compared to SqlBulkCopy it's ofcourse slower, but the advantage is that it supports FK fixup automatically ang gives better error messages and might even be faster under a certain treshold. It already needs code to decide in which order a graph should be inserted so should be a quite easy implementation.

FransBouma commented 9 years ago

Batch inserts are multiple insert statements packed together in 1 DbCommand. Are you referring to that? Or are you referring to: INSERT INTO .. (..) VALUES (..); SELECT SCOPE_IDENTITY(); ? i.o.w: return the values interesting after insert to the caller in the same command? IMHO it already does this (otherwise graph saves are not possible).

MikaelEliasson commented 9 years ago

What it allows you to do is that if you are going to insert 2 BlogPosts with 2 Comments each you can now do that in two sql statements.

In EF7 the code below generates the sql that comes after

var post = new BlogPost
{
    Title = "Hej1",
    Body = "Hej1 Body",
    Comments = new List<Comment>()
    {
        new Comment{ Author = "Author1", Body = "Comment1"},
        new Comment{ Author = "Author2", Body = "Comment2"},
    },
};
db.Posts.Add(post);
foreach (var item in post.Comments)
{
    item.Post = post;
}
db.Comments.AddRange(post.Comments);

var post2 = new BlogPost
{
    Title = "Hej2",
    Body = "Hej2 Body",
    Comments = new List<Comment>()
    {
        new Comment{ Author = "Author3", Body = "Comment3"},
        new Comment{ Author = "Author4", Body = "Comment4"},
    },
};
db.Posts.Add(post2);
foreach (var item in post2.Comments)
{
    item.Post = post2;
}
db.Comments.AddRange(post2.Comments);

db.SaveChanges();

exec sp_executesql N'SET NOCOUNT OFF;
INSERT INTO [BlogPost] ([Body], [Title])
OUTPUT INSERTED.[Id]
VALUES (@p0, @p1),
(@p2, @p3);
',N'@p0 nvarchar(9),@p1 nvarchar(4),@p2 nvarchar(9),@p3 nvarchar(4)',@p0=N'Hej1 Body',@p1=N'Hej1',@p2=N'Hej2 Body',@p3=N'Hej2'

exec sp_executesql N'SET NOCOUNT OFF;
INSERT INTO [Comment] ([Author], [BlogPostId], [Body])
OUTPUT INSERTED.[Id]
VALUES (@p4, @p5, @p6),
(@p7, @p8, @p9),
(@p10, @p11, @p12),
(@p13, @p14, @p15);
',N'@p4 nvarchar(7),@p5 int,@p6 nvarchar(8),@p7 nvarchar(7),@p8 int,@p9 nvarchar(8),@p10 nvarchar(7),@p11 int,@p12 nvarchar(8),@p13 nvarchar(7),@p14 int,@p15 nvarchar(8)',@p4=N'Author1',@p5=1,@p6=N'Comment1',@p7=N'Author2',@p8=1,@p9=N'Comment2',@p10=N'Author3',@p11=2,@p12=N'Comment3',@p13=N'Author4',@p14=2,@p15=N'Comment4'

By using output you could pack for example 50 inserts into one statement and still get the ids properly populated. It returns tabular data instead of a scalar you get with SCOPE_IDENTITY().

FransBouma commented 9 years ago

ah! I didn't know that option was available in T-SQL! It can be used indeed if there's no relationship with self (so all rows inserted aren't referenced by any other row inserted), also not sure whether other DBs have similar mechanics available, but it's worth checking out.

It might clash though with per-row features added later on (auditing/authorization/validation) but we'll see. :)

MikaelEliasson commented 9 years ago

Even self referencing relationsships should be solvable. If it's worh it is another question though.

Building a dependency graph should solve it in the generic case. It's a bit of work but should be very doable. If we assume the most common use case you will only have 1-M self refering relationsships (a hierarchy) and they Add() the items as they "should"

Parent { Children : [Child1 { Children : [Child1.1, Child1.2] }, Child2] }

This is easily divided by using a series of collections.

1: Add all the 0 level items to Batch1
2: Add all the children of items in Batch1 to Batch2
3- (N-1): Repeat untill empty
N: Insert B1, propagate the returned id's to the FK's in B2, repeat untill empty

N-M relationsships: are not a problem as all the items are can be inserted at once and the mapping table get inserted after.

1-1 relationsships: Not even sure if L2S supports it. I don't think I used it. I think it can be done but if you insert both entities it would require inserting and then updating as far as I can see (should be same problem with single row).

A more generic algorithm is needed if the user decides to insert like:

Parent{ P1 }
Child{ C1, Parent: P1 }
Child{ C2, Parent: P1 }

Add(P1)
Add(C1)
Add(C2)

I have to think about if I can find a simple and fast algorithm for that. I have a few ideas but need to test/google a bit.

FransBouma commented 9 years ago

L2S supports 1:1 relationships just fine, it doesn't support m:n relationships :)

Sorting the dependencies isn't hard, it's done using a DAG and the algorithm 'Topological Sorting'. It's not doable to insert dependent rows within the same batch, you have to insert them separately (first PK side, then FK side with PK value obtained from PK side insert).

One main pain point with batch inserts is that if a row fails (e.g. due to a UC violation or FK violation), the whole batch fails but you don't know which one failed. For me this has always been a showstopper, as I don't see how it can be beneficial to the user to have a persist fail with a generic 'something when wrong, dunno what, figure it out!' kind of error ;)

FransBouma / LinqToSQL2

Batch inserts #15