dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.09k stars 4.7k forks source link

Sync API surface of ADO.NET - what should be done? #26692

Closed roji closed 4 years ago

roji commented 6 years ago

While looking into rebuilding Npgsql on top of System.IO.Pipelines, I noticed that it has no sync APIs. This issue isn't to complain about that or to ask for sync APIs, but rather to ask for guidance. The problem is that being old as it is, the ADO.NET API includes both sync and async methods (e.g. ExecuteReader(), ExecuteReaderAsync()). Up to now Npgsql has fully supported both sync and async paths for all operations (not an easy task), but with new stuff like pipelines that doesn't really seem possible anymore.

My options seem to be:

  1. Throw for sync APIs. This would immediately break a large number of the applications out there.
  2. Implement the sync methods as sync-over-async. This would create deadlocks in user code, not sure it's better than throwing (less obvious).
  3. Add sync APIs to System.IO.Pipelines (this would have include any endpoint adapter, such as Marc Gravell's Pipelines.Sockets.Unofficial).
  4. Sadly decide that System.IO.Pipelines can't be used in this scenario (would be a shame, it's pretty awesome for exactly this kind of thing).

/cc @davidfowl @stephentoub @divega @ajcvickers

(Note: I do think there is are some justifications for keeping sync I/O APIs alongside async ones, but I tried arguing for that before and didn't get very far - the general mood seemed to be that sync is dead and gone. If anyone is interested in having this conversation again let me know.)

mgravell commented 6 years ago

Just as a "5" - and to note: SE.Redis has very similar sync+async API concerns; the way I'm addressing it is to use a form of "sync over async" (it isn't just .Wait() or .Result, but it isn't far off and those would work) - and I just make sure I strictly use IO threads in the reader code that are never exposed to the outside, i.e. they can't be the ones calling in. I do this using a custom PipeScheduler that is a dedicated thread-pool - although in most cases the global thread-pool would also be fine (I'm just cautious over thread-pool starvation being a real issue that we've seen many times, and as a data lib: things go very bad very quick if you depend on the thread-pool).

I don't know if these are good options, but they're the options I'm using today :)

stephentoub commented 6 years ago

If you care about getting the best throughput for the sync APIs (I don't know if you do), then you'll want to avoid any of the sync-over-async options.

roji commented 6 years ago

I'm not sure whether I should care about sync throughout, that's part of what I'm hoping to get out of this thread... If the general position is that sync I/O really is legacy/obsolete, then I guess I only need to make it work (rather than perform particularly well).

Besides, what are my alternatives, assuming I want the benefits of pipelines?

davidfowl commented 6 years ago

I don’t think we should add sync overloads to pipelines. I appreciate there are cases where it is better than async IO (there’s more overhead in general) but in most cases we recommend that people do async IO instead. I appreciate that there are challenges adapting to older APIs that do have sync code paths but I’m not convinced we should complicate the API as a result.

Kestrel implements a Stream over the PipeReader and PipeWriter and for the sync overloads, we do sync over async mostly because we don’t care about the performance of sync IO. Also becuse libuv only has async API, it wasn’t even possible to implement sync overloads.

There are other more advanced options for using the API synchronously but it involves using the PipeReader as an abstraction over something that is synchronous. ReadAsync would block inline and basically always return a completed ValueTask with the result.

roji commented 6 years ago

FWIW if it's somehow possible to have a version where the I/O methods would be guaranteed to return synchronously, that would amount to adding a sync API (and so it would solve my issue). But i'm not nearly familiar enough with pipelines to know what that would entail.

How do you avoid the deadlocks associated with sync over sync in Kestrel? Are you somehow ensuring that continuations don't run on the .NET TP like @mgravell, via a PipeScheduler with its own pool (if I understood him correctly)? As of now that seems to be the best option here.

davidfowl commented 6 years ago

FWIW if it's somehow possible to have a version where the I/O methods would be guaranteed to return synchronously, that would amount to adding a sync API (and so it would solve my issue). But i'm not nearly familiar enough with pipelines to know what that would entail.

Pipelines is an abstraction and it has a default implementation. There's no built in IO but the patterns are optimized for IO. If this is a real issue in practice then I would expose a synchronous PipeReader/PipeWriter implementation over synchronous IO APIs with the existing surface area. I wouldn't want to do this as a first step though but it's do-able as a last resort.

How do you avoid the deadlocks associated with sync over sync in Kestrel?

There are no deadlocks, you may be referring to the typical SyncronizationContext deadlock that occurs in .NET Framework because of particular implementations (like Winforms, WPF and ASP.NET non core). ASP.NET Core doesn't have a sync context so there's no deadlock but you can just starve the thread pool more easily if you end up using sync IO. We don't care because you should be using async IO on the server. It can also be mitigated by using async IO to buffer the request data if you have to use sync APIs. We do this today in MVC e.g. https://github.com/aspnet/Mvc/blob/4f1f97b5d524b344c34a25a7031691626d50ec68/src/Microsoft.AspNetCore.Mvc.Formatters.Json/JsonInputFormatter.cs#L243-L249.

Are you somehow ensuring that continuations don't run on the .NET TP like @mgravell, via a PipeScheduler with its own pool (if I understood him correctly)?

No, they run on the thread pool. You can use your own scheduler to determine where continuations run if you don't want to starve the thread pool for blocking IO. It still won't be as performant as using sync IO all the way down but I don't know how big of a concern that is here.

roji commented 6 years ago

How do you avoid the deadlocks associated with sync over sync in Kestrel?

There are no deadlocks, you may be referring to the typical SyncronizationContext deadlock that occurs in .NET Framework because of particular implementations (like Winforms, WPF and ASP.NET non core). ASP.NET Core doesn't have a sync context so there's no deadlock but you can just starve the thread pool more easily if you end up using sync IO. We don't care because you should be using async IO on the server. It can also be mitigated by using async IO to buffer the request data if you have to use sync APIs. We do this today in MVC e.g. https://github.com/aspnet/Mvc/blob/4f1f97b5d524b344c34a25a7031691626d50ec68/src/Microsoft.AspNetCore.Mvc.Formatters.Json/JsonInputFormatter.cs#L243-L249.

I'm thinking of another deadlock that can occur with no SynchronizationContext. If the application uses TP threads which invoke sync-over-async API calls, then if there are enough of these the thread pool gets exhausted. At that point you have TP thread synchronously blocking on async operations that require a continuation to run on a TP thread, but none are available. I've encountered this a few times in the history of Npgsql, it makes me very wary of sync-over-async. I'm interested if you guys are mitigating this in any way (or maybe you don't have to for some reason).

Are you somehow ensuring that continuations don't run on the .NET TP like @mgravell, via a PipeScheduler with its own pool (if I understood him correctly)?

No, they run on the thread pool. You can use your own scheduler to determine where continuations run if you don't want to starve the thread pool for blocking IO. It still won't be as performant as using sync IO all the way down but I don't know how big of a concern that is here.

Yes, that's @mgravell's solution, involving a private thread pool which removes the dependency on the .NET TP. It does seem like the most promising option. I'm not necessarily looking to provide great performance, but rather not to break (or deadlock) existing applications written against the sync ADO.NET APIs, so this is could work. Am still interested in how you avoid the deadlocking in Kestrel though.

davidfowl commented 6 years ago

I told you we don’t. It’s atypical to use sync IO in web scenarios so we don’t protect users from starving the thread pool if they use blocking API for read/write. We have a couple of other mitigating factors:

Barring that, we don’t guard against it.

mgravell commented 6 years ago

Just to confirm, yes @roji that's exactly the reason I try hard to avoid TP for this. When there are spare threads, it is fine - but when it jams, if you are the data lib that some of those workers are blocking against: it jams hard, and even though it isn't a true deadlock, it might as well be - new work will come in much faster than TP grows. TP exhaustion has hurt us hard in the past, as a library, so we try hard not to take a dependency in it for the IO work

roji commented 6 years ago

@davidfowl just curious, are you aware of users running into deadlocks, have people complained? I suspect that Kestrel is in a different situation than an ADO.NET driver, which is potentially used in much more various scenarios - I know a lot of my users use sync APIs, and at some point when Npgsql had this kind of deadlock this caused a lot of anguish for people.

Thanks for confirming @mgravell, the situation and the solution seem pretty clear - I'll look at what you've done and see about integrating it into Npgsql.

mgravell commented 6 years ago

@roji I won't stop you, but I'm 110% open to better solutions

roji commented 6 years ago

@mgravell what would you suggest? Should we take a stab at a sync implementation of the pipelines abstraction, where methods are guaranteed to always return synchronously? Any idea how much work that would be?

mgravell commented 6 years ago

I guess if you were desperate for a pure sync API, you could play both ends of the pipe from a single thread, but if you wanted to do it while fully overlapping read IO (to avoid pauses when you run out of input data) would be harder. Frankly I'm not sure I see that it is worth bending the usage out of shape - I'm guessing you allow both sync and async on the same connection (at different times), so it would be very hard to switch between them at whim. I think I'd just go with something that is approximately sync over async, like I have :)

roji commented 6 years ago

I'll have to become more familiar with pipelines to understand exactly what the different options are, but @davidfowl seemed to hint that at least in theory would could replace the default PipeReader and PipeWriter with sync-only implementations, which seems like an alternative to what you're proposing...

And yes, there's nothing stopping ADO.NET users from executing some things in sync and others in async on the same connection...

mgravell commented 6 years ago

that basically means re-implementing Pipe from the ground up; which you can certainly do, and there's lots of pieces you could re-use, but: it isn't a small ask. However: even with zero changes there's actually only 2 important APIs here: and one of them (FlushAsync) you can make "always sync" simply by making the threshold non-positive on the PipeOptions. The other already has a sync option: TryRead. The combination of those 2 is what I meant by working both ends of the Pipe yourself; I guess I just don't see the point of doing that, though.

roji commented 6 years ago

That's interesting...

But while TryRead() is synchronous, it's also non-blocking, isn't it? How would we go about making the thread block synchronously while waiting for data to arrive?

(I realize this discussion is a bit academic as we'll probably go down your sync-over-async route anyway, but it's still interesting)

mgravell commented 6 years ago

that's what I'm saying... it gets messy; you could basically do:

while(true)
{
    if(TryRead(...))
    {
        if(TryProcessBuffer(...)) {
             AdvanceReader(whatWeUsed)
             return result;
        }
        AdvanceReader(start, end)
    }

    GetWriteBufferFromWriter()
    SyncReadFromNetworkIntoWriteBuffer()
    AdvanceWriter()
}

one downside is that it doesn't make it easy to read from the network and process data concurrently. Also it is very much not how Pipe was intended to be used...

roji commented 6 years ago

Ohh, I get it - you're suggesting we totally bypass the pipelines API for actual I/O reads. So as long as data is readily (i.e. synchronously) available TryRead() is used, and the moment it fails we drop down to blocking directly on the stream.

Yeah, I don't think we want to go down this kind of route, but I'll keep it in mind as I work on Npgsql (in any case there's lots to do before hitting the problem of sync I/O).

However, regarding reading from the network and processing data concurrently, that seems like a very clear (and understandable) result of wanting to be purely sync. If we really wanted to, we could offload data processing to another thread (or task) - this is very similar to the how things are when doing sync I/O on streams, isn't it?

bgrainger commented 6 years ago

If the application uses TP threads which invoke sync-over-async API calls, then if there are enough of these the thread pool gets exhausted.

I have also experienced this exact situation with MySqlConnector.

We don't care because you should be using async IO on the server.

Unfortunately, there is a lot of legacy code that does sync I/O, particularly (with MySQL) because Oracle's MySQL connector never implemented async I/O (so there was no benefit to using it).

I haven't followed everything that's been discussed on this thread yet, but I'm interested in helping figure out the “right” solution here (or just reusing yours if you get to it first 😀).

mgravell commented 6 years ago

Probably an unrelated topic, but one thing I am hearing loud and clear: lots of people involved in data libraries (things like ADO.NET or redis) still have huge anxiety attacks about thread-pool exhaustion scenarios - a consequence of being "the thing being called by a ton of simultaneous load". Happy to raise it is a different (but tangentially related) issue, but that is something I'd love to get an expert opinion from @stephentoub on, especially whether it is something that either the framework or an "official" MS lib could offer help with, rather than things like my clunky badly written custom thread-pool.

roji commented 6 years ago

@mgravell I've seen the same anxiety around TP exhaustion scenarios and deadlocks (and have been bitten myself) - people have applications that work fine in testing but lock up under load. I'm not sure what a library could do to help with that but it's definitely worth discussing.

The basic advice I give is to generally avoid blocking TP threads by using async wherever possible, and especially to avoid blocking where the operation completion may depend on the TP thread, i.e. the sync-over-async exhaustion deadlock issue. This is why it's imperative for a library to provide sync operation implementation that don't trigger this issue, because you never know who calls you and from where.

davidfowl commented 6 years ago

This is why it's imperative for a library to provide sync operation implementation that don't trigger this issue, because you never know who calls you and from where.

It makes it less likely because you'll be using 1 threads instead of 2 per operation but blocking is blocking. For e.g. ASP.NET runs your code on a thread pool thread and you'll be blocking it. I'm not saying that's an excuse to do sync over async, it's just why we recommend (in ASP.NET Scenarios) people use async APIs.

ajcvickers commented 6 years ago

Just a couple of comments from a non-web perspective:

So we're going to keep having sync methods in EF, for example, and people will keep using them. They should work, without the potential for deadlock, and ideally without regressing current perf.

roji commented 6 years ago

Responding to @ajcvickers (and in general), Npgsql is a good example of a library where all the stream handling/buffering has been done manually. The system does work, which is why I'm not really looking at adopting pipelines for a speed boost. However, as @mgravell mentioned in his post, it involves lots of potentially brittle code which I would love to get rid of. In other words, for me pipelines is about refactoring Npgsql and making it much simpler/nicer, rather than an optimization.

As such it's true that it would be a shame to regress sync I/O perf, but given the discussion above a sync-over-async API seems like the only way forward if I want to experiment with pipelines. I'm still interested enough to go ahead with the experiment, measure the regression, and decide whether its worth the gain in codebase quality.

But if another solution comes up that allows using pipelines with a sync API, without deadlocks and without a perf regression, that would obviously be really great.

Tornhoof commented 6 years ago

My two cents, as I ran into them just recently:

davidfowl commented 6 years ago

Sync DB Apis are still used in seeding databases, e.g. for test setup

Sure, it's hard to starve the thread pool here so sync over async is fine.

Having a sync Pipeline API might help the json efforts in corefxlab, as there is no async around byref types.

Not quite. The plan is to have a higher level API to do async reading. The lower level parsing APIs are synchronous but there will still need to be async reading to get more data.

Tornhoof commented 6 years ago

Not quite. The plan is to have a higher level API to do async reading [...]

Even better, I'm looking forward to see how you'll design that.

davidfowl commented 6 years ago

@ajcvickers

Any sync-over-async must be safe from deadlocks for non web scenarios running in multiple different platforms. For example, the fact that ASP.NET core doesn't have a sync context is largely irrelevant because many times the code won't be running in ASP.NET Core, or even .NET Core for that matter.

Is it common to run out of thread pool threads in client applications?

Sync I/O is still used commonly outside of web scenarios. Also, some database providers, such as SQLite don't do async at all.

That's a good point. SQL Lite is just a file, it isn't streaming data back and forth from the network.

Right now, sync I/O is usually faster than async for simple scenarios. It would be a shame if it was not only not faster than async, but suddenly became considerably slower.

We should get some benchmarks ready so we can understand what scenarios. I'll admit, I don't have a good sense for how ADO.NET is used in non-server environments where thread pool starvation is a concern. Any extra data/scenarios here would be great.

@roji

As such it's true that it would be a shame to regress sync I/O perf, but given the discussion above a sync-over-async API seems like the only way forward if I want to experiment with pipelines. I'm still interested enough to go ahead with the experiment, measure the regression, and decide whether its worth the gain in codebase quality.

👍 . I'm interested in this as well. Given that you could potentially be reading already buffered data most of the time because of read ahead, it may not be as bad as we think in the usual case. That plus a combination of your own threads to avoid thread pool starvation (which seems to be recurring theme here) may be good enough and might not cause a regression.

But if another solution comes up that allows using pipelines with a sync API, without deadlocks and without a perf regression, that would obviously be really great.

This is what I was talking about earlier. Here's an incomplete PipeReader implementation over a FileStream doing synchronous IO.

https://github.com/davidfowl/StringsAreEvil/blob/301d68ab85d17fbf2d3a4f0d01f33b04fac36ec5/StringsAreEvil/Program.cs#L351

It's one of the core reasons PipeReader and PipeWriter are indeed abstract classes that exist outside of the single Pipe implementation. I'm not convinced that this should be the first option but it's an option if need be.

roji commented 6 years ago

@davidfowl,

Any sync-over-async must be safe from deadlocks for non web scenarios running in multiple different platforms. For example, the fact that ASP.NET core doesn't have a sync context is largely irrelevant because many times the code won't be running in ASP.NET Core, or even .NET Core for that matter.

Is it common to run out of thread pool threads in client applications?

From my side I can only say users report all scenarios :) In the TP, out of the TP... Sometimes it's just a console program doing ADO.NET from the main application thread, sometimes its Orleans with its own task scheduler, it's really very diverse once you step out of ASP.NET.

And just to be sure everyone's on the same page, when we're talking about deadlocks we're discussing two distinct (but somewhat similar) things:

As such it's true that it would be a shame to regress sync I/O perf, but given the discussion above a sync-over-async API seems like the only way forward if I want to experiment with pipelines. I'm still interested enough to go ahead with the experiment, measure the regression, and decide whether its worth the gain in codebase quality.

+1 . I'm interested in this as well. Given that you could potentially be reading already buffered data most of the time because of read ahead, it may not be as bad as we think in the usual case. That plus a combination of your own threads to avoid thread pool starvation (which seems to be recurring theme here) may be good enough and might not cause a regression.

It's true that read buffering naturally reduces the impact of any perf regression on actual I/O (because you're doing less actual I/O calls). However, we've seen from previous benchmarks that the actual async machinery of resuming a continuation on another thread (doesn't matter where) does have a cost when compared to a simple sync call (I think @ajcvickers is saying the same thing). So unless we have a totally sync pipe implementation (like the incomplete FilePipeReader you posted), we're going to take a perf hit, the only question is how much.

In any case, I'll go ahead and see about redoing Npgsql over pipelines, and see how everything works asynchronously - we can always tackle sync a bit later. The nice thing is that it's an abstraction, so if we decide to go for it we can simply swap the pipeline implementation with a fully-sync one (alongside the regular one for async) at the end of the process. I'll try to report on it soon.

davidfowl commented 6 years ago

Deadlock resulting from SynchronizationContext (so mainly GUI programs, not ASP.NET), where the UI thread blocks synchronously on an async operation that will never complete.

That one is trivial to avoid.

Deadlock-like scenario resulting from TP thread exhaustion, where all TP threads block synchronously on async operations that require TP threads to complete.

This one requires a couple of things:

From my side I can only say users report all scenarios :) In the TP, out of the TP... Sometimes it's just a console program doing ADO.NET from the main application thread, sometimes its Orleans with its own task scheduler, it's really very diverse once you step out of ASP.NET.

Sure.

It's true that read buffering naturally reduces the impact of any perf regression on actual I/O (because you're doing less actual I/O calls). However, we've seen from previous benchmarks that the actual async machinery of resuming a continuation on another thread (doesn't matter where) does have a cost when compared to a simple sync call (I think @ajcvickers is saying the same thing).

Absolutely, and there are ways to avoid that, like using TryRead and to avoid the async state machine in what would hopefully end up being the majority of cases (though thats harder to predict).

In any case, I'll go ahead and see about redoing Npgsql over pipelines, and see how everything works asynchronously - we can always tackle sync a bit later. The nice thing is that it's an abstraction, so if we decide to go for it we can simply swap the pipeline implementation with a fully-sync one (alongside the regular one for async) at the end of the process. I'll try to report on it soon.

Keep us posted.

roji commented 6 years ago

Closing as the different options seem to have been discussed here.