dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.63k stars 3.15k forks source link

Helps on how to create an empty PostgreSQL database quickly #26541

Closed JonPSmith closed 1 year ago

JonPSmith commented 2 years ago

I have added PostgreSQL helpers to my EfCore.TestSupport. These create unique database names for each xUnit test class, options with logging etc.

I have found that creating an empty database using EnsureDeleted, then EnsureCreated takes ~12 seconds on my setup ( EnsureDeleted = 4 seconds, EnsureCreated = 8 second), which I think is a bit slow (I want fast tests!).

I have therefore created a method that uses Jimmy Bogart's Respawn library, and that takes 2.2 seconds if there is a database already there.

I also tried the drop all tables approach. The PostgreSQL commands run quickly from pgAdmin but it actually takes longer than the Respawn approach (something like 4 seconds) because EnsureCreated has to add the tables.

My questions are:

NOTE: you can see the code for the fast / respawn approach here.

Include provider and version information

NOTE: My PostgreSQL 12 server is running in a Windows Subsystem for Linux (WSL2). Its a fast PC.

EF Core version: 6.0.0-rc.2 Database provider: Npgsql.EntityFrameworkCore.PostgreSQL" Version="6.0.0-rc.2" Target framework: (e.g. .NET 6.0-rc.2) Operating system: Windows IDE: (e.g. Visual Studio 2022)

roji commented 2 years ago

Is 12 seconds for EnsureDeleted, then EnsureCreated a normal time, or is something I am doing wrong?

I gave it a try, and on my machine it's more like 7 seconds in total...

I also tried the drop all tables approach. The PostgreSQL commands run quickly from pgAdmin but it actually takes longer than the Respawn approach (something like 4 seconds) because EnsureCreated has to add the tables.

This is odd - actually creating tables should be very fast, the thing that takes up most of the time in EnsureCreated should be the database creation itself, which shouldn't be occurring in this case...

Stepping back, is the idea to create a different table schema on the same database (I'm not sure why that would be necessary), or simply to empty the tables from their data? If it's the latter, then rather than dropping tables (like with the DROP SCHEMA public CASCADE approach), you could use the PG TRUNCATE command (which is an efficient form of DELETE FROM table); note that it supports CASCADE as well.

If you really need to drop and recreate all tables, then the DROP SCHEMA public CASCADE approach seems reasonable, but I can't imagine the whole thing taking so long - can you share a project/schema?

ajcvickers commented 2 years ago

I guess ProstgreSQL is just a slow database system. :trollface: 😉

roji commented 2 years ago

:fire: :firecracker: :fist_oncoming:

roji commented 2 years ago

Yeah, databases in general are known to be more heavyweight in PG compared to e.g. SQL Server. Among other things, a connection is always made to a specific database, and switching the database on an existing connection simply isn't possible. PG definitely isn't made for creating/dropping lots databases quickly (but I'm not sure that's a good pattern even in other databases where it's faster...).

ajcvickers commented 2 years ago

You are so easy to troll. 😹

roji commented 2 years ago

I know, I know... I'm extremely trollable when it comes to PG :)

JonPSmith commented 2 years ago

Hi @roji,

Thanks for looking at this, and sorry for opening you up to @ajcvickers remarks ☹️ !!

I gave it a try, and on my machine it's more like 7 seconds in total...

OK, that sounds that my 12 second, which might be because you are on a pure Linux system. I remember before EF Core 5 (well really the Microsoft.Data.SqlClient change) EnsureDeleted + EnsureCreated took 10 seconds, and now its 1.5 seconds.

This is odd - actually creating tables should be very fast...

I enabled LogFor with timings and it looks like opening a connection to the server takes 1.5 seconds. To try this run the test TestPostgreSqlHelpers.TestEnsureCreatedAndEmptyPostgreSqlOk and see the output (details of the repo at the end).

Stepping back, is the idea to create a different table schema on the same database?

Ideally I would like a PostgreSQL version of your SQL Server EnsureClean method inside the test code of EF Core. @ajcvickers pointed this out and I have copied it into the EfCore.TestSupport library - its really fast (500 ms!) and it makes sure the database is up to date to the EF Core Model too.

However, Having a fast "empty all rows" approach is acceptable because the EfCore.TestSupport library has a DeleteAllPostgreSqlUnitTestDatabases that deletes all the PostgreSql databases that start by the test common database name. So, when the EF Core Model changes the developer has to run this and the next run of the test will create a new database.

can you share a project/schema?

The open-source library is in the repo https://github.com/JonPSmith/EfCore.TestSupport and the code to look at:

If you want to run the tests, then you need to edit PostgreSqlConnection connection string in the the appsetting.json in the Test project. The actual database name starts with the database name in that connection string but has the test class name added on the end so that each parallel unit test class has its own, unique database.

roji commented 2 years ago

I enabled LogFor with timings and it looks like opening a connection to the server takes 1.5 seconds. To try this run the test TestPostgreSqlHelpers.TestEnsureCreatedAndEmptyPostgreSqlOk and see the output (details of the repo at the end).

I ran TestEnsureCreatedAndEmptyPostgreSqlOk - the first time it took about 6 seconds, the second time about 1.6. This seems to be because the test does EnsureCreated, but not EnsureDeleted, so almost nothing has to happen the second time around; this seems problematic, by the way - if the schema changes, this test will start failing since EnsureCreated doesn't make sure the schema is in sync with the model. Adding EnsureDeleted before EnsureCreated makes the test run consistently in about 7.5 seconds.

To summarize, any approach which drops/creates an entire database per test or test class is generally going to be slow; maybe especially so on PostgreSQL, but probably on other databases as well (compared to alternative approaches, in any case). I don't think there's any way around that really.

Ideally I would like a PostgreSQL version of your SQL Server EnsureClean method inside the test code of EF Core. @ajcvickers pointed this out and I have copied it into the EfCore.TestSupport library - its really fast (500 ms!) and it makes sure the database is up to date to the EF Core Model too.

That exists - see NpgsqlDatabaseCleaner, you can use that if you want. Generally all EF Core providers which implement the EF specification test suite should have an implementation for this; I've opened #26550 to consider bringing this functionality into the product and exposing it to users writing integration tests.

One note on the implementation of database create/drop logic... Rather than copying database-specific logic (as you've done e.g. in DeleteAllPostgreSqlUnitTestDatabases), you may want to consider setting up a simple DbContext with the appropriate connection string, and then simply using context.Database.EnsureDeleted/Created; this would leverage the provider's IDatabaseCreator which already knows how to do all this, rather than copying the logic out for each database.

DeleteAllPostgreSqlUnitTestDatabases

Just to nitpick :rofl: Should this be called "unit test" databases? Typically "unit test" implies no real database is involved (as opposed to integration test).

JonPSmith commented 2 years ago

Hi @roji,

I ran TestEnsureCreatedAndEmptyPostgreSqlOk - the first time it took about 6 seconds, the second time about 1.6.

Yes, I assume that the EF Core Model doesn't change a lot so I was testing the "already have database" version. Other tests time the EnsureDeleted + EnsureCreated situation. I explain that if the EF Core Model you need to call the "DeleteaAll" method, but in SQL Server having to delete all the test databases isn't used so much because of SQL Server EnsureClean.

That exists - see NpgsqlDatabaseCleaner, you can use that if you want.

That would be perfect!! I will certainly look at that. For the SQL Server EnsureClean I had to edit the CreateDatabaseModelFactory ctor, which I will need to do with NpgsqlDatabaseCleaner. I have added a comment on issue #26550.

One note on the implementation of database create/drop logic

Yes, using context.Database.EnsureDeleted/Created would be a better way. I'll have a go at that.

Just to nitpick 🤣 Should this be called "unit test" databases...

Good point. I will change that.

Thanks for looking at this and providing really good tips/code. Thanks.

roji commented 2 years ago

My pleasure, and thanks for working on this and raising these thoughts - as I said I'm thinking about testing at the moment for the docs, so this is very relevant.

JonPSmith commented 2 years ago

Hi @roji

OK, I have copied the NpgsqlDatabaseCleaner code from the https://github.com/npgsql/efcore.pg repo (and added a heading with the efcore.pg license) and I have got it to run.

Issues I came across

Performance figures

Here are the timings for the three approaches. NOTES:

Approach 7 tables 100 tables
EnsureDeleted + EnsureCreated 12.8 sec 13.4 sec
EnsureClean (PostgreSQL) 4.3 sec 5.1 sec
Respawn 2.2 sec 2.8 sec

I will provide both approaches:

Certainly EnsureClean is a better approach to new developers as it automatically handles EF Core Model changes, so I will list the EnsureClean approach first with a recommendation to use that. Personally I think I would go with the Respawn approach because its nearly 50% quicker than EnsureClean (PostgreSQL), but its async, which is a pain as exception stacktrace is harder to read - only using these approaches for real will tell me what my preference is.

PS. If you want to look at the code its in the dev branch of the https://github.com/JonPSmith/EfCore.TestSupport . I don't put it into master branch until its ready to be turned into a NuGet package.

roji commented 2 years ago

@JonPSmith thanks for measuring (and all the general discussion)! I do have one additional thought... Would it be possible for you to also benchmark the DROP SCHEMA public CASCADE approach and added it to the above table? My thinking is the following... NpgsqlDatabaseCleaner (as well as the SQL Server one) are currently implemented by reverse engineering the database, and then dropping each detected table/sequence one-by-one. This probably isn't the most efficient way to clear a database out - a small piece of procedural SQL which detects which schemas are defined and then drops them all would very likely be better. If that's so, then I'd also consider replacing the implementation of NpgsqlDatabaseCleaner to simply do that, instead of relying on reverse-engineering.

I suspect that this may get us to something very similar to Respawn's running time, if not even better.

roji commented 2 years ago

And one more note...

At least when presenting the various options to users, I'd make a very clear distinction between approaches which recreate the schema (EnsureDeleted+EnsureCreated, EnsureClean), and those which only clear out data (Respawn, TRUNCATE). I'd say the latter aren't suitable as a setup method at the very beginning of the test run - since schemas do change - but can make a lot of sense between tests within the same run. So at least IMHO it's not really a question of which is preferable/should be listed first - these are different things which solve different problems.

Personally I think I would go with the Respawn approach because its nearly 50% quicker than EnsureClean (PostgreSQL), but its async, which is a pain as exception stacktrace is harder to read

So aside from the fact using Respawn for first step would make tests fail if the schema changes and require manual intervention, I'm curious that async makes you want to stay away - with practically everything else being async nowadays :)

roji commented 2 years ago

The NpgsqlDatabaseCleaner contains a code to detect ServerCompatibilityMode.Redshift, but it uses a NET6 only feature (conn.Settings). My library is multi-framework (netstandard2.1 = EF Core 5, net6.0 = EF Core 6). I have used an #if NET6_0_OR_GREATER which assumes that the database doesn't have the ServerCompatibilityMode.Redshift setting in EF Core 5. Is that sensible?

I don't think NpgsqlConnection.Settings should be NET6.0-only... This is an Npgsql (ADO.NET) feature, and Npgsql itself still supports netstandard2.0 - so you should be fine (note the distinction between the Npgsql EF Core provider which is .NET 6.0 only - like EF Core - and the underlying ADO.NET driver). Regardless, this specific thing is to make sure things work when executing against Amazon Redshift, which is a PostgreSQL-like database. If you need to remove it for some reason, then it should really be OK to do so.

The NpgsqlDatabaseCleaner uses an internal NpgsqlSqlGenerationHelper. Shouldn't be a problem, but the SQL Server Cleaner didn't use any internal types.

Yeah... It doesn't actually need to reference NpgsqlSqlGenerationHelper - it just needs the provider implementation for ISqlGenerationHelper (in order to delimit some stuff), so if you can simply resolve that from the context and pass it to the cleaner, that should work fine. The current Cleaner doesn't use proper DI practices because it's just test infrastructure.

JonPSmith commented 2 years ago

Hi @roji,

Would it be possible for you to also benchmark the DROP SCHEMA public CASCADE approach.

I did try DROP SCHEMA public CASCADE and its slower Respawn and about the same as your cleaner - about 4 seconds for small table. I prefer your cleaner as DROP SCHEMA public CASCADE is about the same time and your cleaner drops other things like functions.

I'd make a very clear distinction between approaches which recreate the schema (EnsureDeleted+EnsureCreated, EnsureClean), and those which only clear out data (Respawn, TRUNCATE).

That has always an issue, especially before EF Core 5, where "delete all databases when EF Core changes" was the standard approach. But I am planning a big re-write of the docs as its not so clear since the EF Core 5 version.

I'm curious that async makes you want to stay away - with practically everything else being async nowadays :)

Good point. I try to use sync if I can because it can be a pain to working out what has gone wrong in exceptions from an async method.

I don't think NpgsqlConnection.Settings should be NET6.0-only

I definitely have an error in the netstandard2.1 version, with the Settings part the error is shown below

Error   CS1061  
'NpgsqlConnection' does not contain a definition for 'Settings' and no accessible extension method 'Settings' accepting a first argument of type 'NpgsqlConnection' could be found (are you missing a using directive or an assembly reference?)    

TestSupport (net6.0) - Available
TestSupport (netstandard2.1) - Not Available

I have spent an hour looking at different version of Npgsql etc. trying to fix this error, but can't find the problem. What I do know is you only added the if (conn.Settings.ServerCompatibilityMode == ServerCompatibilityMode.Redshift)... in the 6.0.0 version so I'm going to comment out with #if NET6_0_OR_GREATER.

Yeah... It doesn't actually need to reference NpgsqlSqlGenerationHelper - it just needs the provider implementation for ISqlGenerationHelper.

I'll stick with the internal code for now. If its still a problem in the later version I can fix it then.

roji commented 2 years ago

I did try DROP SCHEMA public CASCADE and its slower Respawn and about the same as your cleaner - about 4 seconds for small table. I prefer your cleaner as DROP SCHEMA public CASCADE is about the same time and your cleaner drops other things like functions.

Wow, that's really surprising... I'll do a bit of playing around too to see what I can come up with.

Note that dropping the schema also drops functions, since functions are always created in a schema - the same is true for most other object types (even custom types). In fact, this makes schema dropping must easier and more reliable: with the cleaner, I have to implement specific reverse-engineering and functions, types, etc., where dropping the schema just drops everything.

I definitely have an error in the netstandard2.1 version, with the Settings part the error is shown below

I think I may know what's going on... NpgsqlConnection.Settings was only publicly exposed in Npgsql 6.0 (https://github.com/npgsql/npgsql/pull/3570), are you by any chance using older version of Npgsql when testing netstandard2.1? If you use Npgsql 6.0 everywhere (it supports netstandard2.0), you should be fine.

JonPSmith commented 2 years ago

Wow, that's really surprising... I'll do a bit of playing around too to see what I can come up with.

I'm NOT an expert on PostgreSQL so I leave that to you! Let me know if you come up with something.

I think I may know what's going on... NpgsqlConnection.Settings.

Ah, the netstandard2.1 version is for EF Core 5 and uses Npgsql.EntityFrameworkCore.PostgreSQL Version="5.0.10", which uses Npgsql Version="5.0.10.

roji commented 2 years ago

Yeah, then that's why... All good, it's perfectly fine to drop that Redshift part in any case. I'll do some benchmarking and post back.

roji commented 2 years ago

Here's a bit of benchmarking - DROP SCHEMA public CASCADE; CREATE SCHEMA public clearly wins on my side. This isn't a big surprise - this technique does a single roundtrip to the database, making PostgreSQL do all the work on its side. NpgsqlDatabaseCleaner, in contrast, does multiple roundtrips to find tables, sequences, functions and other database entities, and then drops each one of these in a separate DDL statement (albeit in a single roundtrip). The second is pretty much guaranteed to be slower.

Note that there's a slightly inaccuracy here, since we need to drop all schemas, and not just the public one. I did that via some procedure SQL in another benchmark method - and it works just as quickly as just dropping public.

So in summary... I think the .NET ecosystem needs a test infrastructure component which knows how to efficiently empty databases (and possibly create a blank one if it doesn't already exist), a bit like how respawn can reset data. This wouldn't be part of EF Core (since non-EF projects could also use it), nor would it use EF Core to do its job like RelationalDatabaseCleaner (since there are more efficient ways). In fact, once something like that exists, I'd propose removing RelationalDatabaseCleaner from the EF tests altogether and use this instead.

BenchmarkDotNet=v0.13.1, OS=ubuntu 21.10
Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.100-rc.2.21505.57
[Host] : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
Job-VXYAUW : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT

InvocationCount=1 UnrollFactor=1

Method NumTables Mean Error StdDev
DropPublicSchema 10 10.42 ms 1.104 ms 3.256 ms
DropAllSchemas 10 11.67 ms 1.185 ms 3.474 ms
DatabaseCleaner 10 222.57 ms 4.235 ms 4.532 ms
DropPublicSchema 100 63.37 ms 2.646 ms 7.507 ms
DropAllSchemas 100 66.10 ms 2.516 ms 7.179 ms
DatabaseCleaner 100 360.20 ms 10.115 ms 29.507 ms
Benchmark code ```c# public class Program { private const string ConnectionString = "Host=localhost;Username=test;Password=test"; private NpgsqlConnection _conn; private NpgsqlCommand _dropAllSchemasCommand; private NpgsqlBatch _createTablesBatch, _dropPublicSchemaBatch; private NpgsqlDatabaseCleaner _cleaner; private FakeContext _context; [Params(10, 100)] public int NumTables { get; set; } [GlobalSetup] public void Setup() { NpgsqlConnection.ClearAllPools(); _conn = new NpgsqlConnection(ConnectionString); _conn.Open(); _createTablesBatch = new NpgsqlBatch(_conn); for (var i = 0; i < NumTables; i++) { _createTablesBatch.BatchCommands.Add(new NpgsqlBatchCommand($@" CREATE TABLE foo{i} ( id INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, name TEXT)")); _createTablesBatch.BatchCommands.Add(new NpgsqlBatchCommand($@" INSERT INTO foo{i} (name) VALUES ('bar1'), ('bar2'), ('bar3');")); } _dropAllSchemasCommand = new NpgsqlCommand { Connection = _conn, CommandText = @" DO $$ DECLARE r RECORD; BEGIN FOR r IN (SELECT nspname FROM pg_namespace WHERE nspname NOT IN ('pg_toast', 'pg_catalog', 'information_schema')) LOOP EXECUTE 'DROP SCHEMA ' || quote_ident(r.nspname) || ' CASCADE'; END LOOP; EXECUTE 'CREATE SCHEMA public'; END $$" }; _dropPublicSchemaBatch = new NpgsqlBatch(_conn) { BatchCommands = { new("DROP SCHEMA public CASCADE"), new("CREATE SCHEMA public") } }; _context = new FakeContext(_conn); _cleaner = new NpgsqlDatabaseCleaner(); CreateTables(); } [IterationSetup] public void CreateTables() { _dropAllSchemasCommand.ExecuteNonQuery(); _createTablesBatch.ExecuteNonQuery(); } [Benchmark] public void DropPublicSchema() => _dropPublicSchemaBatch.ExecuteNonQuery(); [Benchmark] public void DropAllSchemas() => _dropAllSchemasCommand.ExecuteNonQuery(); [Benchmark] public void DatabaseCleaner() => _cleaner.Clean(_context.Database); static void Main(string[] args) => BenchmarkRunner.Run(); class FakeContext : DbContext { private readonly NpgsqlConnection _connection; public FakeContext(NpgsqlConnection connection) => _connection = connection; protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder) => optionsBuilder.UseNpgsql(_connection); } } ```
JonPSmith commented 2 years ago

Looks great! Overall are you saying that you want a database provider to create the best/fast EnsureClean method. I think that would be great (see my setUpSchema suggestion for an additional feature).

What do you think: should I use your new code or stick with your existing NpgsqlDatabaseCleaner code?

PS. One thing the the stack overflow answer said was:

If you are using PostgreSQL 9.3 or greater, you may also need to restore the default grants.

GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO public;
roji commented 2 years ago

Overall are you saying that you want a database provider to create the best/fast EnsureClean method.

Yeah, although this would likely be some community package which knows different databases and how to clean them up. I don't think this belongs in an EF provider, or in an ADO.NET provider, but there's no reason this can't be some external package, similar to how respawn already handles data resetting across databases.

What do you think: should I use your new code or stick with your existing NpgsqlDatabaseCleaner code?

I think my procedural SQL fragment should be preferred (simply since it's faster) - though you should be able to see that in your own perf measurements... If not, it's probably worth understanding what's going on...

If you are using PostgreSQL 9.3 or greater, you may also need to restore the default grants.

Good point, thanks!

JonPSmith commented 2 years ago

I membered that the DROP SCHEMA commands were executed quickly, and it was the call to EnsureCreated that was slow. In your code you have a CreateTables() but I don't see the code. I see an creator.CreateTables(); in the RelationalDatabaseCleaner - is that what you are using??

roji commented 2 years ago

The CreateTables method is there in the benchmark source code - it's just some manually-prepared batch to create some tables and populate them.

Doesn't EF Core's EnsureCreated have to be called either way in order to create the schema? In other words, the RelationalDatabaseCleaner eventually has to do the exact same thing, i.e. call CreateTables... EF Core's EnsureCreated checks if the database exists (would be true in our case), then checks whether any tables exist (would be false), and then calls into the same CreateTables that RelationalDatabaseCleaner calls into.

JonPSmith commented 2 years ago

My testing said that EnsureCreated took up most the time in the drop scheme approach. You might need to check your code with EnsureCreated (or the creator.CreateTables(); from the RelationalDatabaseCleaner

roji commented 2 years ago

@JonPSmith I'll take another look.

RelationalDatabaseCleaner internally does (almost) the same thing as EnsureCreated to recreate the schema. So we should effectively be comparing the two ways to drop all schemas. But maybe I'm missing something.

JonPSmith commented 2 years ago

Hi @roji,

Thanks for you help. As you will seen I have released a version EfCore.TestSupport which supports EF Core 5 and EF Core 6 now that NET6 is out (with PostgreSQL methods including EnsureCreated). I expect you are busy with that update too.

I have quite a few libraries so I need to focus on NET 6 release first, but I do still want to see if the DROP SCHEMA approach will improve this. I expect you are really busy so I'm happy to try your DROP SCHEMA code and add a call the EnsureCreated. I suspect it will be slow because that's what I found with with my simple DROP SCHEMA but its worth a check.

PS. I included the Respawn approach in the new release, but the documents don't include the Respawn approach. If the DROP SCHEMA is faster I'll remove the Respawn approach, otherwise I will describe it with the out-of-date schema problem.

JonPSmith commented 2 years ago

Hi @roji,

I added a simple benchmark test and added context.Database.EnsureCreated(); after the DROP SCHEMA and I the DROP SCHEMA approach wins, but with as big as your benchmark. Here is the results for 7 tables

Method Mean Error StdDev
DropPublicSchemaWithEnsureCreated 2.125 s 0.0188 s 0.0175 s
DropAllSchemasWithEnsureCreated 2.132 s 0.0177 s 0.0165 s
EnsureClean 4.232 s 0.0238 s 0.0223 s
EnsureCreatedAndWipedByRespawn 2.127 s 0.0121 s 0.0113 s

The DROP SCHEMA approach takes the same time as the Respawn approach, but of course the DROP SCHEMA also makes sure the schema is correct for the current EF Core Model. I therefore plan to change my PostgreSQL EnsureClean method to to use the DropAllSchemasWithEnsureCreated code.

Can I ask you to decided if your current DROP All SCHEMA code should include the extra SQL commands suggested in the stack overflow answer. If you think this is needed could you send me the updated SQL code of the DROP All SCHEMA with commands below added.

GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO public;

Thanks in advance.

roji commented 2 years ago

@JonPSmith thanks for continuing to look into this - I'm still stuck on other things but was intending to look again myself.

Re the extra commands to grant permissions, I'd say these aren't really necessary in our specific context of testing. The schema creation (which happens after the drop) grants access to the creating user, which is very likely also the user on which the tests are running. So unless some very convoluted things are happening in the test suite, including some specific games around different users and permissions, these wouldn't be necessary.

But if you're looking for a fully complete fragment that brings your database back to a clean slate that is as close as possible to a newly-created database, then this can be included:

DO $$
DECLARE
    r RECORD;
BEGIN
    FOR r IN (SELECT nspname FROM pg_namespace WHERE nspname NOT IN ('pg_toast', 'pg_catalog', 'information_schema'))
        LOOP
            EXECUTE 'DROP SCHEMA ' || quote_ident(r.nspname) || ' CASCADE';
        END LOOP;
    EXECUTE 'CREATE SCHEMA public; GRANT ALL ON SCHEMA public TO postgres; GRANT ALL ON SCHEMA public TO public';
END $$

I'd be surprised if these two statements have any noticeable impact on perf - you can check that out if you want. Assuming that's the case, you may as well just include them...

JonPSmith commented 2 years ago

Wow, talk about going down a rabbit hole. Performance tuning is hard work!

OK, in summary we both forget about the test to see if the database exists (but updating my tests caught that). I turns out that testing if the database exists costs a LOT of time - the tests below suggests its about 2 second on my system. My conclusions are:

Method Mean Error StdDev Median
EnsureCleanUsingDropSchema 4,219.43 ms 14.794 ms 13.838 ms 4,220.12 ms
WipedByRespawnNoCheckDbExists 58.34 ms 1.594 ms 4.623 ms 59.59 ms
WipedByRespawnWithCheckForDbExists 2,128.80 ms 17.257 ms 16.143 ms 2,128.82 ms
EnsureDeletedEnsureCreated 12,685.31 ms 51.618 ms 48.283 ms 12,692.66 ms

If you have any great ideas for improving the database exists code then let me know. NOTE: I did try adding an IF EXISTS ... to the DROP SCHEMA SQL, but the connection fails because there is no database.

My next steps:

JonPSmith commented 2 years ago

PS. The PostgreSQL connections performance feels like the old Microsoft.Data.SqlClient library which used to be slow, but was updated when EF Core 5 came out.

Could the Npgsql connection management be improved??

roji commented 2 years ago

Good point about the database existence check - but are you saying that simply adding it adds like 2 seconds (on top of the schema drop)? I did a quick benchmark on my machine, and I'm seeing NpgsqlDatabaseCreator.ExistsAsync complete in only 12ms - when the database already exists. I'm assuming we're interested in that scenario, since after the very first time tests are executed, the database will always exist and gets reused.

So it would be good to understand exactly where why it's taking so long on your side... Are you connecting to a remote (possibly far-away) PostgreSQL? Are you using some other method and not NpgsqlDatabaseCreator.ExistsAsync?

I've opened https://github.com/npgsql/efcore.pg/issues/2103 to track looking at the perf EnsureDeleted/EnsureCreated - I agree it seems long and warrants investigation. Though I still believe tests could be using a different, more efficient mechanism for getting to clean state (like the above) - it really doesn't make sense to me to be constantly dropping and creating databases when running tests.

Full benchmark results:

BenchmarkDotNet=v0.13.1, OS=ubuntu 21.10 Intel Xeon W-2133 CPU 3.60GHz, 1 CPU, 12 logical and 6 physical cores .NET SDK=6.0.100 [Host] : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT DefaultJob : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT

Method Mean Error StdDev
Foo 12.01 ms 0.240 ms 0.351 ms
Benchmark code ```c# BenchmarkRunner.Run(); public class Benchmarks { private BlogContext _context; [GlobalSetup] public async Task Setup() { _context = new BlogContext(); await _context.Database.EnsureDeletedAsync(); await _context.Database.EnsureCreatedAsync(); } [Benchmark] public Task Foo() => _context.GetService().ExistsAsync(); public class BlogContext : DbContext { public DbSet Blogs { get; set; } static ILoggerFactory ContextLoggerFactory => LoggerFactory.Create(b => b.AddConsole().AddFilter("", LogLevel.Information)); protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder) => optionsBuilder .UseNpgsql(@"Host=localhost;Username=test;Password=test") .EnableSensitiveDataLogging() .UseLoggerFactory(ContextLoggerFactory); } public class Blog { public int Id { get; set; } public string Name { get; set; } } } ```
JonPSmith commented 2 years ago

I called context.Database.EnsureCreated() on an existing database and it took 2 seconds. My PostgreSQL server 12.8 is running in a WSL2 Windows Subsystem for Linux. If you know a better way to run PostgreSQL server then let me know!

The interesting thing is accessing an 'normal' database seems to be quick (as shown by the Respawn times), but accessing the postgre database takes the time.

Update

I didn't check that the database exists, but caught the exception. This reduces the time from 6 second to 4 seconds. The code below checks if there is a database and it takes 2 seconds!! If I run the same SQL in pgAdmin it takes 50 ms. It's got to be the time to open the connection to the postgre database, but why is that??

private static bool DatabaseExists(this string connectionString)
{
    var builder = new NpgsqlConnectionStringBuilder(connectionString);
    var orgDbStartsWith = builder.Database;
    builder.Database = "postgres";
    var newConnectionString = builder.ToString();
    using var conn = new NpgsqlConnection(newConnectionString);
    conn.Open();

    using var cmd = new NpgsqlCommand($"SELECT COUNT(*) FROM pg_catalog.pg_database WHERE datname='{orgDbStartsWith}'", conn);
    return (long)cmd.ExecuteScalar() == 1;
}
roji commented 2 years ago

Maybe there's some sort of auth config issue with your PostgreSQL, I don't know - but just connecting to the postgres database definitely shouldn't take 2 seconds. Maybe it's a good idea to test this on plain Windows, just to eliminate the possibility that WSL2 is somehow responsible (would be weird, but...), or if something is somehow wrong with your PG installation inside WSL2... Installing PG on Windows is super easy.

JonPSmith commented 2 years ago

Found it!!!!

I did some googling about slow access and one answer was host should be 127.0.0.1, not localhost and that fixed it. It is now super-quick!!

I ran my benchmark, but the EnsureCleanUsingDropSchema part had the error Npgsql.PostgresException (0x80004005): 53300: sorry, too many clients already, but looking at what it did do it was 90 ms.

Method Mean Error StdDev
EnsureCleanUsingDropSchema ~90 ms NA NA
WipedByRespawnNoCheckDbExists 41.97 ms 1.045 ms 3.081 ms
WipedByRespawnWithCheckForDbExists 67.34 ms 2.912 ms 8.586 ms
EnsureDeletedEnsureCreated 340.45 ms 16.030 ms 47.266 ms

That is fantastic! I have to update my article now.

Thanks for putting up with my problem, but the end result is mega.

@ajcvickers, see PostgreSQL isn't slow, just I just got it wrong.

roji commented 2 years ago

see PostgreSQL isn't slow, just I just got it wrong.

image

I ran my benchmark, but the EnsureCleanUsingDropSchema part had the error Npgsql.PostgresException (0x80004005): 53300: sorry, too many clients already, but looking at what it did do it was 90 ms.

That happens if you open too many connection concurrently (or forget to close somewhere), bringing you above the Max Pool Size limit, which is 100 by default.

After all this discussion, can you please summarize what EnsureCleanUsingDropSchema does (there were some back and forths :sweat_smile:)? Are you calling the provider's Exists method, and then DROP SCHEMA pubilc if it returns true (and EnsureCreated if false)?

JonPSmith commented 2 years ago

Hi @roji,

I have finished this and here is a summary:

What I did do

What I didn't do

PS. Can you tell me how to tell Benchmark not to run more than 100 clients so that I don't get the 53300: sorry, too many clients already error. I looked at the Benchmark docs but I couldn't work it out.

roji commented 2 years ago

PS. Can you tell me how to tell Benchmark not to run more than 100 clients so that I don't get the 53300: sorry, too many clients already error. I looked at the Benchmark docs but I couldn't work it out.

Is your benchmark using BenchmarkDotNet? If so, BDN is single-threaded, and should not ever run the the benchmark method more than once concurrently; you may have a connection leak or something - if you point me at the benchmark source I can try to help.

roji commented 2 years ago

Other than that, thanks for the details - everything looks great, and it's good to have this as a nuget package people can integrate in their tests! As discussed offline, I think it would be good to have something like this without any sort of EF Core dependency: a simple, cross-database package that can be used to bring a database to a clean slate state, regardless of whether the database exists or not, or whether it already has tables. EF Core users could invoke this, and then just do EF Core's EnsureCreate, which would create the schema. I'll look into this and will keep you posted.

ErikEJ commented 2 years ago

@JonPSmith : Maybe update this note?

/// NOTE: This only works for SQL Server

JonPSmith commented 2 years ago

@roji , found the problem - one of the NpgsqlConnection didn't have a using var. Hard to catch that type of bug.

Benchmark results are:

Method Mean Error StdDev
EnsureCleanUsingDropSchema 68.73 ms 3.467 ms 10.168 ms
EnsureDeletedEnsureCreated 316.62 ms 10.348 ms 30.022 ms

As I said, the EnsureClean is so fast that its not worth using the Respawn approach.

@ErikEJ, thanks for point that out. I have updated that.

New NuGet release 5.2.1 now out.

roji commented 2 years ago

Thanks @JonPSmith!

JonPSmith commented 11 months ago

Hi @roji,

I'm looking at updating my libraries to .NET 8 and I am starting with my EfCore.TestSupport as I use this to check my other libraries.

I have come to an error in the EnsureClean feature. The Npgsql.EntityFrameworkCore.PostgreSQL" Version="8.0.0-rc.1" doesn't have the CheckPoint feature that is in the older versions. Here is the code that doesn't compile

public async static Task EnsureCreatedAndEmptyPostgreSqlAsync<T>(this T context, bool thereIsAnExistingDatabase = false
    where T : DbContext
{
    Checkpoint EmptyCheckpoint = new Checkpoint
    {
        DbAdapter = DbAdapter.Postgres
    };

    if (thereIsAnExistingDatabase || !context.Database.EnsureCreated())
    {
        //the database already exists, so we just need to empty the tables

        var connectionString = context.Database.GetConnectionString();
        using var conn = new NpgsqlConnection(connectionString);              
        await conn.OpenAsync();
        await EmptyCheckpoint.Reset(conn);          
    };
}

Looks like this has changed. Can you point me to the code to fix this.

roji commented 11 months ago

@JonPSmith this Checkpoint class isn't familiar to me, are you sure it used to be part of Npgsql.EntityFrameworkCore.PostgreSQL? Can you point me to it in the older version or similar?

JonPSmith commented 11 months ago

Sorry @roji my mistake. The Postgres code is fine but I added another approach using Respawn, which has the error. It wasn't quicker that the Postgres code, but it wasn't but I forgot to remove that code.

Sorry to bother you.