dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.78k stars 3.19k forks source link

Basic support for creating entities on partitioned tables in SQL Server (Code-First) #33505

Open airbreather opened 7 months ago

airbreather commented 7 months ago

Background

SQL Server supports something called partitioned tables. In one of our ORM-free legacy applications, we've used this functionality to great effect to accelerate managing large data sets with minimal logging, after our first attempt at these kinds of tools backfired when the transaction log file tried to outgrow the disk it was on.

What problem are you trying to solve?

Trying to replicate our solution in another application, which uses EF Core (Code-First style), has been a challenge. With many other SQL Server features, even if EF Core doesn't support them directly, migrationBuilder.Sql("..."); is good enough...

...but not for partitioned tables. I can't just create a table (or index) as normal and then issue a command that says "take this existing table (or index) and put it onto that partition scheme". The table (or index) needs to be created on that partition scheme, so the entire CREATE TABLE or (CREATE INDEX) would need to go into migrationBuilder.Sql("...");.

My first thought was to do CREATE PARTITION FUNCTION / CREATE PARTITION SCHEME in migrationBuilder.Sql("..."); as normal, and then inherit from SqlServerMigrationsSqlGenerator and override Generate(CreateTableOperation, ...) to tweak the CREATE TABLE statement to add ON ThePartitionScheme(ThePartitioningColumn) to the right spot (given the right custom annotations).

This was incomplete: RelationalDatabaseCreator.EnsureCreatedAsync tries to run those very migration operations* when the database is created, and so I also needed my Generate(CreateTableOperation, ...) override to be able to start by creating the partition function and scheme if those do not exist.

Ultimately, I have something that I think is going to be functional for a while, but I'm worried how much custom code I had to write in order to do this, including a suppression or two for EF1001.

Describe the solution you'd like

I don't think my use case needs particularly rich support in order to see a huge improvement. I think that I would be happy if I could write something like this in my DbContext:

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    // include the full SQL to create the partition function and scheme in their initial states.
    modelBuilder
        .HasPartitionFunction(createSql: "CREATE PARTITION FUNCTION PF_Blog /* ... */")
        .HasPartitionScheme(createSql: "CREATE PARTITION SCHEME PS_Blog /* ... */");

    // entity types themselves should have an easier time.
    modelBuilder.Entity<Blog>()
        .HasPartitioning(partitionScheme: "PS_Blog", partitionColumn: "BlogId");
}

I'm going to keep this initial post short and add more context, thoughts, and details into a comment.

airbreather commented 7 months ago

Describe the solution you'd like (cont'd)

I don't know exactly which part of the system should pick up the HasPartitionFunction / HasPartitionScheme bits. Maybe there's a better place, but my instinct is to just tack on all the HasPartitionFunction commands, followed by all the HasPartitionScheme commands, onto the end of Generate(SqlServerCreateDatabaseOperation, ...).

I'm going to be less precise when talking about how the .HasPartitioning part should affect things because I've only looked into it as far as what I must do in order to slam the most basic possible thing into our application kicking and screaming:

  1. When creating a new table annotated with whatever HasPartitioning turns into, it's as simple as can be: in this exact gap, throw in ON ThePartitionScheme(ThePartitionColumn) and I think you're just done.

  2. When creating a new index annotated that way (as opposed to the whole table), I'm sure there's an equivalent spot that's just as easy.

  3. When changing an entity's partition scheme or partition column during a migration (whether that's adding partitioning where there was none before, removing partitioning where it was there before, or changing the partitioning scheme or column from what it was), I think it has to be done like:

    a. Create the new version of the table (with an auto-generated name) structured exactly how it will need to be, including the final partition scheme and partition column (or lack thereof), all constraints, all indexes, the whole thing. b. Copy all the data from the old version of the table into the new version of the table (yuck) c. Drop the old version of the table d. sp_rename the new version of the table to match the name of the old version of the table that was just dropped.

  4. When changing an index's partition scheme or partition column during a migration as with the table, I think it's a lot easier just drop the index and recreate it on that partition scheme.

Taking a look at that, I think I would be almost as happy if only partitioned indexes are supported with this. I'd have to make sure to hit all indexes, including the ones created by ForeignKeyIndexConvention, but that's a lot easier to swallow than what I'm doing now.

Nice-to-have future work

Even once the tables are created, for these specific entities, EF Core basically will only help us with queries. All the manipulation of these data sets is going to have to go through hand-rolled SQL and SqlBulkCopy. I would be blown away if some of these patterns were to get first-class support in EF Core, but I'm not exactly going to hold my breath.

To insert a new self-contained bundle of rows

Prerequisites:

  1. You need a "staging" table that looks otherwise identical to the partitioned table, except it's not partitioned.

    • I've done this by having a base class with all the stuff and then empty Partitioned and Staging subclasses that differ only in whether or not they are partitioned.
  2. You need "ALTER TABLE" permissions on both tables.

Assumptions:

  1. either use a transaction around the whole thing, or coordinate in some other way to make sure that only one process is doing this at a time.

    • If this is overly restrictive, I suppose that you could create as many "staging" tables as concurrent tasks that you need to be able to run in parallel and then tweak this appropriately.
  2. "staging" table is kept completely empty at all times when it's not being used for exactly these bulk operations

  3. database recovery model is either SIMPLE or BULK_LOGGED

Steps:

  1. Get the new ID for the bundle (perhaps by inserting an entity into a related table and fetching its identity value)
  2. Make liberal use of ALTER PARTITION FUNCTION SPLIT RANGE to make sure that the new bundle's data set gets its own dedicated partition.
  3. Using SqlBulkCopy with the appropriate flags, insert the new rows into the "staging" table
  4. Add a constraint to confirm that all the rows in the "staging" table have the correct ID in their partitioning column
  5. ALTER TABLE SWITCH the data from the "staging" table into the appropriate partition of the partitioned table
  6. Drop the constraint that we added before

To delete a self-contained bundle of rows

Prerequisite and assumptions are the same as inserting the new bundle.

Steps (before SQL Server 2016):

  1. ALTER TABLE SWITCH the data from the appropriate partition of the partitioned table into the "staging" table.
  2. TRUNCATE the "staging" table.

Steps (SQL Server 2016+) (I have not tested this variant):

  1. TRUNCATE TABLE WITH PARTITIONS
roji commented 7 months ago

Thanks @airbreather.

Without diving too deep, this is probably one of those DDL features where there would be relatively less value in 1st-class EF support (and where that support would be a lot of work); there are a lot of various table configuration features across databases, where users can simply use custom SQL directly in their migrations to achieve what they want.

For example, specifically on this:

modelBuilder
    .HasPartitionFunction(createSql: "CREATE PARTITION FUNCTION PF_Blog /* ... */")
    .HasPartitionScheme(createSql: "CREATE PARTITION SCHEME PS_Blog /* ... */");

Since the function and scheme are defined as SQL in any case, HasPartitionFunction/Scheme really provide little advantage over simply introducing the appropriate CREATE PARTITION directly into a migration. Modeling features in EF can sometimes be useful where users can change some aspect of the object, and EF knows how to alter that aspect in the database, but here we're really dealing with a function/scheme definition which is just a SQL. So the value here seems very limited.

Allowing specifying partitioning on CREATE TABLE has a bit more value, since it allows users to continue using regular EF modeling for table creation, rather than the alternative of replacing EF's CreateTable() in migrations with SQL that includes the partition (though that's not a terrible workaround either for those tables which are partitioned).

I'll also just note that this is the first time IIRC that we've received a request to support partitioning on SQL Server (note similar request for table partitioning on the PostgreSQL side which has received lots of votes and interest over the years: https://github.com/npgsql/efcore.pg/issues/1035).

airbreather commented 7 months ago

Since the function and scheme are defined as SQL in any case, HasPartitionFunction/Scheme really provide little advantage over simply introducing the appropriate CREATE PARTITION directly into a migration.

If not for the wrinkle about RelationalDatabaseCreator.EnsureCreatedAsync running the SQL generated by Generate(CreateTableOperation, ...), I would completely agree.

As it stands, from my perspective (I've got to run right now, so I can't elaborate), if the "regular EF modeling for table creation" part were implemented without this, then it looks like there's no way to use RelationalDatabaseCreator.EnsureCreatedAsync in an application that also wants to use partitioned tables?

roji commented 7 months ago

Yeah, EnsureCreatedAsync is indeed incompatible with custom SQL, and that's (only) one reason why it's generally unsuitable for real applications. It really is a feature for test code and demos, nothing more - so yeah, I don't think there's a way to use EnsureCreatedAsync and have partitioned tables (or any other DDL/migration feature that EF doesn't handle).

airbreather commented 7 months ago

That makes sense. For context, we use it to initially create a LocalDB database for automated tests that we immediately wipe out with a RESTORE DATABASE command since our data seeding code is a bit slow to do the whole thing on every single CI build. MigrateAsync comes after that, since we don't want to have to recreate that baseline database backup file for every single migration no matter how small...

I can just switch that EnsureCreatedAsync call to another MigrateAsync, though, if that's the recommended alternative to adding a spot for running arbitrary SQL between when the database is created and when the tables are created.

roji commented 7 months ago

Yep, test database management is indeed a whole domain that can be pretty complex :)

So yeah, I'd definitely recommend switching to MigrateAsync for your test database creation; assuming you do use migrations for your production app, this also guarantees that your tests run on the same database schema that your actual app does (notably around questions of custom SQL). The main possible disadvantage of that is that if you start to have a big migration history, that history gets replayed when creating the test database; a possible mitigation is to squash your migration history every now and then and start over. Though if you're creating your database only once when starting a CI run (which is generally good), that hopefully isn't a big deal.

Data seeding specifically is something that should generally be handled separately (especially if lots of data is involved), and so should be orthogonal to how you create your database or schema (i.e. EnsureCreated or Migrate).

yedajiang44 commented 5 months ago

When I wanted to use efcore and pgsql to implement a partition table, I read a lot of information and finally found that there was no support (https://github.com/npgsql/efcore.pg/issues/1035).

I am a little overwhelmed now. I hope efcore can support it.