divega commented 7 years ago

We have some ideas that we want to try, which involve giving low-level networking for ADO.NET providers a similar makeover treatment to what did with Kestrel some time ago.

@davidfowl (and @anpete?) volunteered to do some experimentation with PostgreSQL, but this could apply to any database.

divega commented 7 years ago

@davidfowl, @anpete could you please detail a bit more (and better) what you had in mind here?

davidfowl commented 7 years ago

@roji I'd like to get a good idea of the threading model and buffering strategy you chose for Npgsql. That'll help with my mental model (I'd like to do some experiments with pipelines).

divega commented 7 years ago

@davidfowl are you actively working on this?

roji commented 7 years ago

@davidfowl it's great to be having this conversation with you guys. Below are some notes on Npgsql works, please don't hesitate to ask more questions.

I/O and Buffering

Npgsql tries to buffer quite aggressively (IMHO). Each physical connection has a read and a write buffer (8k each by default), and almost all Npgsql code reads and writes binary data to and from these buffers, only flushing when they're full (or when the operation has been completely written/read). Take a look at NpgsqlReadBuffer and NpgsqlWriteBuffer, which contain methods such as {Read,Write}Int32().
Actual I/O only takes place when necessary. In other words, once my query has been sent to PostgreSQL, a single NetworkStream.Read{,Async}() call takes place for the entire read buffer size (8k). At this point it's probable that many result rows are read into the buffer, and subsequent calls to NpgsqlDataReader.Read() will be memory-only - until the read buffer is exhausted and another I/O call occurs.
The minimum unit for buffering is the row. As long as rows are relatively small, they get read in one I/O call and are later processed from memory. However, big rows (>8k) will cause the read buffer to grow, causing allocations. To avoid this, you can either increase the buffer size (connection string parameter), or use CommandBehavior.Sequential when executing the command - this will switch to a more advanced mode where we don't attempt to buffer the entire row in memory (or anything else for that matter), and it's the user's responsibility to read the row's columns in the order in which they arrive from the database (this is a standard ADO.NET API feature). This really only makes sense for scenarios with very big rows (e.g. binary data).
Importantly, Npgsql performs reading and writing concurrently, not waiting for the entire operation to write before we start to read the results. This is first of all necessary to avoid an important deadlock: when sending a multi-statement command, PostgreSQL will stop reading from its connection once it sees the protocol message for executing the first statement - and will start writing back its results. We therefore can't attempt to write all the statements' messages before transferring control back to the user and starting to read the first statement's results. This also has the nice side effect of optimizing I/O, since writing and reading happens in parallel (although this is only relevant for multi-statement queries with large parameter and result payloads).

Async and Sync I/O

When implementing async support for Npgsql, I made the decision to keep providing both a true sync and true async API. This was first and foremost because ADO.NET provides both options, and because in some scenarios truly sync database access may be more appropriate (simpler programming model, possibly more efficient implementation depending on O/S, etc.).
The implementation went through several iterations, ending up in a single codepath of async methods inside Npgsql, which all pass an async flag down the stack. At the very bottom, when actual I/O has to occur, we test on the flag and execute either Read() or ReadAsync(). This allows us to avoid duplicating code (i.e. same functionality written once as a regular method and once as an async method) without any significant overhead (I hope!).
Note that I did try to minimize calling async methods as much as possible to avoid the added overhead. ValueTask is used in all places where it's not 100% sure that I/O will occur (i.e. thanks to in-the memory buffering described above).

Synchronization

The only major point where synchronization is currently necessary is the connection pool, which is invoked when NpgsqlConnection.Open() and NpgsqlConnection.Close() are called and which is responsible for assigning a physical connection (what Npgsql calls a "connector") to the user's NpgsqlConnection.
Apart from the connection pool, Npgsql code contains almost no synchronization. Here's a list where it does, although I'm not sure how important this is to this discussion:
- A SemaphoreSlim is taken as long as any user action (e.g. query) is in progress. This is to produce meaningful "already in use" errors in case the user wrongfully uses the same connection concurrently, and also to synchronize with Npgsql's internal keepalive mechanism. Note that the SemaphoreSlim itself is protected by a regular lock, to prevent race conditions with disposing the semaphore.
- There's also a lock to make sure we wait until command cancellation completes before sending a new command (cancellation in PostgreSQL is a completely "asynchronous" affair that takes place on a different physical socket).

I hope this is a first fair description of Npgsql's internals, whipped up just before I head off for work :) Please feel free to ask more questions, I'll be closely monitoring this thread and the other performance-related ones. I'd definitely welcome experimentation with a different I/O model (e.g. pipelines).

davidfowl commented 7 years ago

@davidfowl are you actively working on this?

No, it's on my (now shorter) list of things to look at. I'm looking at using minimal code to build a simple postgres driver based on a protocol spec that @mikeharder shared a while back. Basically a single query with results to benchmark some of the patterns.

@roji Is something like this interesting for npgsql?

roji commented 7 years ago

@davidfowl do you mean building a non-ADO.NET driver for PostgreSQL, possibly non-managed (i.e. based on libpq)? Or just an attempt to build things properly from the ground up to see where it leads?

Npgsql is obviously a fully managed ADO.NET driver, but of course anything you do in terms of PostgreSQL connectivity interests me - you're also welcome to ask me questions, I've had a lot of experience working with the wire protocol. Also, at some point in the email exchanges I mentioned the possibility of splitting Npgsql (or any other ADO.NET driver) into two parts: a low-level ultra-efficient API, on top of which we'd built an ADO.NET adapter. This would allow performance-hungry users to drop down to the low-level API as needed. The low-level API could simply be libpq, or it could be fully-managed implementation (possibly libq first and managed later). There's also the question of whether this low-level component would expose some sort of standardized new database API - a high-performance competitor to ADO.NET, which, among other things, would allow you to include database sockets in an epoll/kqueue programming model (currently impossible with ADO.NET).

All this is a bit theoretical, and this would obviously mean very big changes for Npgsql (it would definitely be Npgsql 4.0 :)). Let me know if this is the direction you were thinking of etc.

davidfowl commented 7 years ago

Or just an attempt to build things properly from the ground up to see where it leads?

This. I wasn't thinking about going as far as building an entire driver. The end goal would be to improve npgsql. The idea is to write a minimal low level API that does the working and protocol so we can focus on things like buffering, IO, async and threading models. Ideally we'd take a very simple query and write the simplest code possible then build up from there. I'd also like to see this example using the lower layer of npgsql (if that's possible).

roji commented 7 years ago

So yeah, I'm absolutely interested... Feel free to reach out if any questions arise during implementation. Note that at the moment Npgsql doesn't really have a low-level layer internally which can be easily separated from an higher-level ADO.NET layer - that was more an idea for the future.

davidfowl commented 7 years ago

@roji, I'm going to try to get something basic working within the next few days. I'll push the code to this repository (in a PR).

roji commented 7 years ago

Great, once you push something I'll definitely take a look - it'll be interesting to compare with Npgsql.

On my side I'm working on bringing the dev branch back to working order (there are some pending issues) and then the plan is to implement command-level caching (https://github.com/npgsql/npgsql/issues/1701), which will provide quite a substantial boost.

sebastienros commented 7 years ago

Here are the result based on @anpete 's implementation of the protocol for the fortunes query.

Same environment as TE on cloud (Azure D3V2 machines). Database server scaled up to 8 cores as it was a limiting factor for the max perf. The new driver is compared to npgsql in async mode with full cached strategy (see https://github.com/aspnet/DataAccessPerformance/issues/3 for explanation).

This configuration we actually max out the client CPU, the client Network and the Database CPU, changing any of these would not change the overall result significantly.

What we can learn from that is that we reached pgbench baseline numbers with a fully managed .NET Core implementation.

We also tried this driver in the TE Fortunes benchmark. We are seeing 45K rps, compared to 16K with in raw ADO.NET mode with npgsql. On this infrastructure the current best result (undertow framework) is at 30K rps.

Next step is to decide what to do with that data, and more specifically if we want to go further with the idea of a different abstraction layer than ADO.NET, or even no abstractions at all but custom providers for each micro/full ORM.

roji commented 7 years ago

@anpete and @sebastienros, it's really great to see Peregrine and its associated benchmarks. It indeed looks like a solid baseline of where we might end up with a managed implementation, and I think it shows that looking at a native solution (e.g. libpq wrapper) or a radically different I/O approach may not be worthwhile.

Here are some notes after reviewing Peregrine. Some of these will point out missing features, which are totally understandable in an early-stage attempt such as this, and the idea is mainly to get us thinking and to evaluate to what extent the high performance is a result of a lack of needed functionality.

First, don't forget that pgbench doesn't exactly do the same as what you're asking Npgsql and Peregrine to do. For one thing, I'm pretty sure the results aren't even parsed - no UTF8 decoding - which I think was a major part of the running time when @damageboy profiled Npgsql.
One big mystery is why Npgsql peaks out at 32 threads whereas Peregrine peaks out at 64 threads. @sebastienros, any chance you can provide more data on what is going on at 64 threads with Npgsql? CPU usage, network utilization? Is it a result of the GC hanging us up?
I like the idea of prepared statements being specified at the pool level, so that all connections returned from that pool have them, and they can be explicitly used by name later. However, it's difficult to see layers such as Dapper or EF leveraging this - those layers produce SQL and generally don't contain explicit APIs for knowing about or reusing server-side statements. So while this is appropriate for a lower-level access API, some automatic/implicit mechanism seems necessary, which would transparently prepare based on usage (any SQL used more than X times), similar to what Npgsql 3.2 introduced. This kind of mechanism would surely have some performance impact (e.g. hashing on the command's SQL to get its corresponding prepared statements).
While we're on the subject... @sebastienros, would it be possible for you to benchmark a special version I hacked up alongside the three above, which does command caching (https://github.com/npgsql/npgsql/issues/1701)? The code isn't fit for merging, but it should show what is possible in Npgsql. The version is available as version 3.3.0-command-caching from the Npgsql unstable feed.
The current implementation assumes that the entire messages to be written (and read) will always fit in the buffer. While this works for TechEmpower's very short statements, in real life messages are sometimes going to be larger than the buffer, or will simply be split down the middle. The wire protocol requires the length upfront, so Npgsql supports this by doing two passes: we first go over the message and calculate how many bytes will be necessary, then we start writing the message itself, possibly filling the buffer several times and flushing.
Note that the above "two-pass" system is also related to parameter validation: Npgsql makes sure parameters are OK before starting to actually send the command to the server (incorrect parameter can be un-encodable strings, out-of-range values and other values which are un-representable in PostgreSQL). Npgsql does this upfront because if a validation error occurs but part of the command has already been written, we cannot complete the operation and the protocol sync will be broken (so the connection is unusable).
For "ultra" high performance, Npgsql can implement a mode where validation is disabled (so a param validation error breaks the connection), and where if the message happens to fit in the buffer a single writing pass is done (it makes sense to optimize for this case). This will shave some time off, but I wonder how actually relevant it is for real-world scenarios (i.e. is the gain worth it). I've opened https://github.com/npgsql/npgsql/issues/1727 to look into this.
Peregrine shifts the entire burden of parsing the results to the user - no Describe message is sent, and the driver doesn't interpret any RowDescription messages. While this is OK for a baseline/proof-of-concept, this means that if a user accidentally tries to read a string as an int they'll get totally corrupt data etc.

sebastienros commented 6 years ago

@roji

As requested, here is and updated graph with your experimental version. This one is from Linux on the async mode. The improvements are impressive.

A note however is that it seems to have issues with the sync mode, with a lot of variation and nearing 0 sometimes.

aspnet / DataAccessPerformance

Idea: better low level networking code #5

I/O and Buffering

Async and Sync I/O

Synchronization