Proposal: Move DbConnectionPool timers into one timer

NickCraver commented 4 years ago

Description

We have observed a large number of timers in our multi-tenant environment and have traced the vast majority (thousands vs. a few) to DbConnectionPool, which has a cleanup timer.

While the general approach for pool cleanup (purging idle connections from the pool) per-pool is fine for things connecting to a single data source or only a few (likely the majority of cases), it doesn't scale well for hundreds, thousands, or hundreds of thousands of connection strings/pools. It creates a timer per active pool.

On Stack Overflow as an example we see ~400-600 timers at any given time and ~580 of them are this cleanup timer on connection pools. Or in the Teams case, it's much higher into thousands active at any given time.

Additionally, this makes debugging more difficult by nature. When looking at memory dumps and analyzing where your objects are rooted, the rather nasty effect of Timers being in a doubly-linked list is more than a little confusing and obfuscating by nature. I'm not sure what the practical impact on running timers this large list has, but it can't be a positive thing.

Regression?

To be clear, this isn't a new/recent regression. It's been this way for a very long time. We're just able to find some of these issues more easily since the migration to .NET Core.

Data

Here are some examples from our infrastructure:

Proposal

We're trying something at Stack via reflection (a lot of reflection) to null out these per-pool timers and instead, in a single timer, loop over all pools (ultimately rooted from System.Data.SqlClient.SqlConnectionFactory.SingletonInstance - the same is true with Microsoft.Data.SqlClient).

The approach isn't that complicated, but I wanted to propose an upstream of this into .NET itself. It's not completely trivial since the instance we're rooting in to find all pools to loop over is in a client that ships side-by-side now, but maybe someone has ideas there. Or, if there are concerns with a single timer - what are they? Our current running assumption is that it was designing the current way to either a) allow flexibility in ways we don't need or understand, or b) scale of pools wasn't a design concern ages ago in this areas. ...or both.

Is this something that would be welcomed upstream, if we can identify a good crawl point?

cc @davidfowl @ajcvickers @mgravell @deanward81 @m0sa

ghost commented 4 years ago

Tagging subscribers to this area: @roji, @ajcvickers Notify danmosemsft if you want to be subscribed.

akoeplinger commented 4 years ago

We have observed a large number of timers in our multi-tenant environment and have traced the vast majority (thousands vs. a few) to DbConnectionPool, which has a cleanup timer.

The link you mentioned links to the Odbc version but you probably want the SqlClient one which is in a different repo: https://github.com/dotnet/SqlClient/blob/5e943028d702cdc948263f96d60a5677860577f4/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/ProviderBase/DbConnectionPool.cs#L395

NickCraver commented 4 years ago

@akoeplinger That's the Microsoft.Data.SqlClient ;) It's weird because some stuff has forked - current repos vs. 3.1 we're currently on diverge a bit. But overall: it lives in DbConnectionPool. That seems to have been forked/duplicated in odd ways with the SqlClient moving out of band, though. I'm not 100% sure we understand the current state TBH.

roji commented 4 years ago

FWIW Npgsql has the same timer-per-pool strategy when idle connection pruning is on, and I agree that it isn't ideal - a single global timer is likely the way to go. Off the top of my head I can't see any reason why a single timer couldn't do the same work just as well, I'm guessing nobody thought of scaling to so many pools.

@NickCraver out of curiosity, are you at ~500 pools because you're actually connecting to 500 different databases? Or is it other variations in the connection string to the same database? A bit of context may shed some light on other possible issues here.

roji commented 4 years ago

(opened https://github.com/npgsql/npgsql/issues/3043 to consider also for Npgsql)

NickCraver commented 4 years ago

Indeed we're connecting to hundreds of databases or many thousands of schemas at once. All different connection strings, from the same app identity.

While our crazy reflection method kind of works, in practice it doesn't do much for rolling loads. Because inactive pools themselves are culled, then they spin back up they get a new timer...until our next pass. If we moved the timer down though, it'd solve that issue.

Here's our approach using a single timer that roll every 5 minutes across all pools and a) nulls out the timer if present, or b) runs the cull if the timer is already null (to prevent double runs, we just null on "first contact"):

So it has made a huge impact, but still a lot happening in the rolling workloads with idle pools scenario. I'd think that's not uncommon for any high count multi-tenancy. If we moved to a shared timer, we'd get the best of all words. We have currently contemplated CreateCleanupTimer with return null at runtime to workaround this, but would much, much rather see a framework change in this department :)

roji commented 4 years ago

@NickCraver I guess the thing I'm not used to seeing is multi-tenancy that's implemented via changes in the connection string (may be some PG vs. SQL Server difference here).

In any case, assuming we're discussing SqlClient, DbConnectionPool has moved out of dotnet/runtime and into dotnet/sqlclient, so Microsoft.DataSqlClient has its own copy and can do what it wants there. So it probably makes sense to move this issue to dotnet/sqlclient, unless someone actually cares about making this improvement in the ODBC/OleDB drivers...

stephentoub commented 4 years ago

I'm not sure what the practical impact on running timers this large list has, but it can't be a positive thing.

On .NET Core, I'd expect the impact to be minimal. I've seen apps with tens of thousands of timers, and while that can have a very measurable effect on .NET Framework (without opting in to some fixes), on .NET Core improvements we made in 2.1 and 3.0 make those issues largely go away, at least from what I've seen. I'd love to know if in practice you see otherwise, i.e. if you actually see measurable degradation in metrics other than timer count.

Brar commented 4 years ago

I'm not used to seeing is multi-tenancy that's implemented via changes in the connection string (may be some PG vs. SQL Server difference here)

@roji I can totally imagine a multi-tenant application that uses one schema per tenant and implements this either via connecting as a different user (via ALTER ROLE tenant1_user SET search_path TO tenant1_schema;) or by using our new Options connection string parameter ("Options=search_path=tenant1_schema") which would both result in one pool per tenant.

NickCraver commented 4 years ago

In any case, assuming we're discussing SqlClient, DbConnectionPool has moved out of dotnet/runtime and into dotnet/sqlclient, so Microsoft.DataSqlClient has its own copy and can do what it wants there. So it probably makes sense to move this issue to dotnet/sqlclient, unless someone actually cares about making this improvement in the ODBC/OleDB drivers...

I can see this both ways, in 1) they're almost entirely just copying all this - do we fork it? or 2) Put it in the base and everyone gets it. Though admittedly, it's 1 global timer per copy of the pooling code.

I'd love to know if in practice you see otherwise, i.e. if you actually see measurable degradation in metrics other than timer count.

The main problem has been debugging and GC rooting for sure. We have at any given time thousands of root paths because of those linked lists. I am not sure if this impacts GC cleanup time or not yet. We've been live a day with reduced timers so we'll check on metrics but the problem we're trying to immediately solve is crazy roots. But, we figured this was an overall good change for all frameworks and couldn't picture a downside, hence the pitch. I'll dig into our overall metrics over the next few days (some things ramp up over time), but honestly it'll be ~1 week to compare minor difference and see much impact when it's not severe. I can say: there's no severe impact, but unsure of minor impact - I don't expect it based on your comments :)

roji commented 4 years ago

@brar

@roji I can totally imagine a multi-tenant application that uses one schema per tenant and implements this either via connecting as a different user (via ALTER ROLE tenant1_user SET search_path TO tenant1_schema;) or by using our new Options connection string parameter ("Options=search_path=tenant1_schema") which would both result in one pool per tenant.

Sure, these are certainly possible. But regardless of this timer issue, it means you're creating extreme fragmentation in your pooling - which has a lot of perf consequences. If your security model is strict as to require connecting as different database users, then yeah, there's no alternative - but that's definitely not the case in most web apps where there aren't separate db users. Re doing it just to set search_path, at least in PG, considering that you can schema-qualify your tables in your queries, or set search_path after opening the connection, massively fragmenting your pooling in this way seems like a general anti-pattern to me...

@NickCraver

I can see this both ways, in 1) they're almost entirely just copying all this - do we fork it? or 2) Put it in the base and everyone gets it. Though admittedly, it's 1 global timer per copy of the pooling code.

AFAICT DbConnectionPool has already been copied across to Microsoft.Data.SqlClient, so it's effectively already forked. So changes would have to happen in MS.Data.SqlClient, and then we could optionally think about doing the same in runtime - where it would only affect the old drivers, which I don't think makes a lot of sense at this point...

m0sa commented 10 months ago

Bump

dotnet / runtime