Meeting notes 5/31/2018

aspnet / DataAccessPerformance

Benchmarks, prototypes and discussions for database access performance investigation

MIT License

116 stars 26 forks source link

Meeting notes 5/31/2018 #42

Open divega opened 6 years ago

divega commented 6 years ago

Attendees: @ajcvickers, @DamianEdwards, @divega, @sebastienros

We discussed today the next steps we can focus on:

Low hanging fruit

Platform versions for EF Core and Dapper: @sebastienros to adapt @benaadams versions
Can we do better on Data Update, single and multiple query tests? @sebastienros can take a look. We wonder if @benaadams would be interested in these.
RavenDB : There is a PR, @sebastienros is tracking it

Longer term investments

Pipelining:
- @divega to invite @roji and @bgrainger to next meeting to brainstorm.
- We want to understand exactly what it is. For example, pipelining vs. multiplexing (out of order) vs. just batching queries from multiple public facing connections into the same physical connection?
- Once we understand we can find out what we could do for each database.
- Can have a mode in which we keep the same public API but with limitations (for example, no transactions)?
- @sebastienros to check if we are already doing batching everywhere we can (we can ask for batching to be revisited if it is disallowed)
SqlClient:
- @divega setting up something up to discuss future SqlClient development, performance issues, and the fact that TE wants to include SQL Server numbers.
Make our infra easy to access for the community: @sebastienros to look into this.
Mongo (@divega to find contacts, @sebastienros to enable TE
EF Core: We want to do more optimizations but no concrete plan ATM
EF6: Once on .NET Core it could be useful, if anything to compare with EF Core
ADO.NET improvements: lots of System.Data issues in Futures milestones
Features that cannot be used in benchmarks:
- Coalescing: Probably a knob once we have pipelining

benaadams commented 6 years ago

[TE] Add SQL Server (on linux) as a database type https://github.com/TechEmpower/FrameworkBenchmarks/pull/3807

bgrainger commented 6 years ago

Pipelining in MySQL

Background

The MySQL protocol is packet-oriented (details). (Note: multiple MySQL packets may be contained in a single TCP packet if they're small enough.) Each MySQL packet (up to 16MiB) has a one-byte sequence number. Client/server interactions are grouped into "conversations" where the first packet has sequence number 0 and each subsequent packet has the next sequence number (with wraparound). If a party in the conversation needs to send multiple packets, it numbers them i, i+1, ..., i+k and the other party replies (if necessary) with i+k+1.

Request/Response

Some conversations have multiple back-and-forth packets, e.g., log on:

<-- Initial handshake (0)
--> Handshake response (1)
<-- Authentication method switch (2)
--> Authentication data (3)
<-- OK (4)

Others (e.g., ping) are very simple, one packet from each side:

--> Ping (0)
<-- OK (1)

A query will result in multiple packets being sent back from the server; note that they're numbered consecutively:

--> "SELECT 1, 2;" Query (0)
<-- Column count (1)
<-- Column 1 (2)
<-- Column 2 (3)
<-- EOF (4)
<-- Row 1 (5)
<-- EOF (6)

Batching

I understand "batching" to mean putting multiple individual SQL statements into one COM_QUERY packet. The server will parse this into multiple SQL statements, then send back multiple result sets. This is easily accomplished in ADO.NET by setting DbCommand.CommandText to multiple semicolon-delimited SQL statements. This looks something like:

--> "SELECT 1; SELECT 2; SELECT 3;" Query (0)
<-- Column count (1)
<-- Column 1 (2)
<-- EOF (4)
<-- Row 1 (5)
<-- EOF + more results (6)
<-- Column count (7)
<-- Column 1 (8)
<-- EOF (9)
<-- Row 1 (10)
<-- EOF + more results (11)
<-- Column count (12)
<-- Column 1 (13)
<-- EOF (14)
<-- Row 1 (15)
<-- EOF (16)

This is done, for example, here: https://github.com/TechEmpower/FrameworkBenchmarks/blob/724a773096f403d149cfaf59b8257465bbf70103/frameworks/CSharp/aspnetcore/Benchmarks/Data/RawDb.cs#L130

Pipelining

By "pipelining" in MySQL, I mean starting multiple separate client conversations before the server response has been received. This is not an official part of the MySQL protocol and some servers (e.g., Amazon Aurora) will simply not respond to it. It looks something like:

--> Ping (0)
--> Reset Connection (0)
--> "SELECT 1;" Query (0)
<-- OK (1)
<-- OK (1)
<-- Column count (1)
<-- Column 1 (2)
<-- EOF (3)
<-- Row 1 (4)
<-- EOF (5)

It appears to work because MySQL Server reads the first packet with sequence ID 0, sends its response, then pulls the next packet off its receive buffer. It reads that as a new conversation with sequence ID 0, so it sends its response to that, and so on. One reason that it works is that very few conversations in the MySQL protocol have multiple replies from both sides; it's generally a client request followed by one or more server packets, then the conversation is implicitly over.

One core difference between batching and pipelining is that pipelining lets you send packets of different kinds; batching is just one outgoing packet with a larger payload.

The only place I've attempted to use it is in the library internals, e.g., sending one TCP packet that includes all the MySQL packets needed to send a "reset connection" request. (As hinted at earlier, this works great with MySQL Server, but not Amazon Aurora, so I reverted it and now all connection resets require multiple network roundtrips.)

This pipelining doesn't map easily on to the ADO.NET API. Conceivably one could let a client execute multiple ExecuteReaderAsync calls before consuming each reader, but because MySQL Server streams the result sets back in order, the client would have to read and dispose each reader in the order they were created. (Right now a client is simply forbidden from calling ExecuteXxx if there is still an open MySqlDataReader.)

Hope this helps; please let me know if anything's unclear.

benaadams commented 6 years ago

By "pipelining" in MySQL, I mean starting multiple separate client conversations before the server response has been received.

As the authentication is the same, could connection pooling work to some degree down the same connection for multiplexing? (e.g. use less "physical connections" than the total "active connections" for fuller TCP packets and less chatter)

--> Ping (0)
--> Reset Connection (0)
--> "SELECT 1;" Query (0)
--> "SELECT 1;" Query (0)
--> "SELECT 1;" Query (0)
--> "SELECT 1;" Query (0)
<-- OK (1)
<-- OK (1)
<-- Column count (1)
<-- Column count (1)
<-- Column count (1)
<-- Column count (1)
<-- Column 1 (2)
<-- Column 1 (2)
<-- Column 1 (2)
<-- Column 1 (2)
<-- EOF (3)
<-- EOF (3)
<-- EOF (3)
<-- EOF (3)
<-- Row 1 (4)
<-- Row 1 (4)
<-- Row 1 (4)
<-- Row 1 (4)
<-- EOF (5)
<-- EOF (5)
<-- EOF (5)
<-- EOF (5)

sebastienros commented 6 years ago

The current model that is suggested by ADO.NET is to pool actual connections and associate them with a single ADO.NET connection. What other "fast" framework do is open a specific number of database connections, and not associate them with the ADO.NET connection itself. But when a command is sent, the ADO.NET connection will look for any currently open connection, or one by affinity, and then use it. It means that two requests, getting two ADO.NET connections, could be using the same database connection, and allowing pipelining. Even without using pipelining this could still improve perf as less database connections are active, and prepared statements are shared more efficiently.

benaadams commented 6 years ago

What other "fast" framework do is open a specific number of database connections, and not associate them with the ADO.NET connection itself.

Base pooling around data stream rather than ADO connection; like?

-- Get from pool
--> "SELECT 1, 2;" Query (0)
<-- Column count (1)
<-- Column 1 (2)
<-- Column 2 (3)
<-- EOF (4)
<-- Row 1 (5)
<-- EOF (6)
-- Return to pool

Makes sense...

bgrainger commented 6 years ago

As the authentication is the same, could connection pooling work to some degree down the same connection for multiplexing?

I think this is theoretically possible, but haven't ever tested it. Presumably each MySqlDataReader would have to know its "order" in the queue (according to the sequence in which requests were sent) and if they were read "out of order" (e.g., on different threads) then it would pull responses off the socket and buffer them on the MySqlDataReader that they were for. Then you'd need to add locking around all I/O, or have a dedicated thread for each socket?

What other "fast" framework do is open a specific number of database connections, and not associate them with the ADO.NET connection itself. But when a command is sent, the ADO.NET connection will look for any currently open connection

The assumption of a dedicated connection is pretty common in client code: variables, temp tables, session settings, etc. are all associated with the current session, which is represented by a MySqlConnection. It would be a breaking change (and quite unexpected) to make a single MySqlConnection object use different physical connections behind the scenes; it could also break client code to multiplex multiple logical connections onto the same physical connection. The driver would have to detect if START TRANSACTION was ever executed, then put that connection into an "exclusive" mode that's no longer shared. It feels like there's a lot of edge cases here. (This could all be opt-in with an EnablePipelining=true connection string option, for clients who know they are using ADO.NET in a compatible way.)

sebastienros commented 6 years ago

It would be a breaking change (and quite unexpected) to make a single MySqlConnection object use different physical connections behind the scenes

I was suggesting the opposite, a single physical connection shared by multiple ado.net ones.

bgrainger commented 6 years ago

It still would cause problems if two non-pipeline-aware clients both execute CREATE TEMPORARY TABLE temp on different MySqlConnection objects that share the same physical server session.

roji commented 6 years ago

A few comments from me...

First, the batching vs. pipelining distinction @bgrainger made above isn't very important for the PostgreSQL case. [PostgreSQL does have two sub-protocols - simple and extended](). The simple protocol has a single wire message (called Query), which may contain multiple SQL statements separated by semicolons, and which PostgreSQL parses on its end (this is what @bgrainger calls "batching"). Unfortunately the simple protocol is quite crippled - it does not support binary value transfers, parameters or prepared statements.

The extended protocol, on the other hand, requires a series of messages for each SQL statement. Executing an unprepared statements involves sending 5 protocol messages (Parse/Bind/Describe/Execute/Sync), while a prepared statement involves sending 3 (Bind/Execute/Sync). It is not permitted to include multiple SQL statements separated by semicolons, so Npgsql performs rudimentary SQL parsing client-side, splitting the CommandText on semicolons and sending the above set of messages for each parsed SQL statement.

I think that the difference between the above two types of batching is rather uninteresting - it's an internal protocol/implementation detail that doesn't affect much, and I tend to think of both of the above as "batching". Note that PostgreSQL messages do not have any sort of ID or running number, so we can send as many messages as we want up-front; PostgreSQL just reads and executes messages as they come in, and the client's responsibility is to keep track of ordering and parse the results accordingly.

I tend to think of pipelining as the ability of a single client (important!) to send a 2nd request before the 1st one has completed; in ADO.NET terms this would mean invoking DbCommand.Execute*() while another command is still running. This isn't currently supported, and AFAIK there's no TechEmpower test which potentially allows this to be exercised (but I may be mistaken).

Finally there's multiplexing, which to me describes the ability to combine queries from multiple clients into one or more physical connections - we're talking about something like a multiple-producer/multiple-consumer model.

As @bgrainger correctly pointed out, there's an ordering issue: since clients may attempt to consume query results in any order, it would require buffering entire resultsets in memory, killing the perf benefits of streaming (which I think can be quite important in many scenarios).
Transactions are obviously out of the question. In theory we could "degrade gracefully": the moment BeginTransaction() is invoked on a DbConnection, a physical connection is bound/affined to its producer in the same way that it is today. This could provide the best of both worlds.
As @bgrainger mentions above, physical connections are stateful objects (beyond transactions), so there can be other issues (temporary tables, various session parameters and settings...). FWIW, pgbouncer is a popular out-of-process connection pool which deals exactly with this kind of stuff: you can do session, transaction or statement pooling. I think that few usage scenarios out there actually make use of session state in this way, but who knows.
I think there's nothing preventing a multiplexing driver from implementing the ADO.NET interface (even if it wasn't the original intent). Nothing actually requires a DbConnection to correspond to a physical connection: Open() be a no-op, and the Execute*() methods would dispatch to one of the available physical connections (ideally the least loaded one). They would then await some TaskCompletionSource until the results are received etc.

And now for some perf analysis:

It's true that in a saturated scenario (all connections busy), multiplexing has its benefit. In the traditional implementation DbConnection.Open() blocks until a physical connection frees up. With multiplexing we can write our query to the server, at least until the TCP window is exhausted.
On the other hand, as long as we're not saturated (some physical connections are available), this may be a degradation. In the traditional model applications are responsible for keeping the connection open for as short a time as possible; as long as they don't hold open connections longer than they have to, multiplexing shouldn't have any advantage (unless I'm missing something) but only increased overhead (see below).
The plumbing for implementing multiplexing probably has some cost. The "synchronization point" is moved from DbConnection.Open() (now implemented in Npgsql by a lock-free pool) to DbCommand.Execute*(). For cases where multiple commands are being executed in series, that means we do synchronization more often when multiplexing. In other words, there may be an advantage to allocating your connection for multiple commands rather than allocating it again and again for each command.
I also suspect that the traditional model may simply have lower overhead, since Execute*() simply and directly awaits on the socket (rather than to a TaskCompletionSource behind plumbing etc.)

I understand that one (some?) of the top TechEmpower performers apparently multiplexes (have not actually looked into this), but do we have any justification for thinking that this is what makes it faster?

bgrainger commented 6 years ago

Unfortunately the simple protocol is quite crippled - it does not support binary value transfers, parameters or prepared statements.

Oh, that reminds me... MySQL also supports a binary protocol, which you have to use if you want to use prepared statements (which are not currently supported in MySqlConnector). Like PG, it doesn't support multiple statements, so MySqlConnector would have to parse the SQL and break them up. It has a three-phase Bind/Execute/Reset model.

Implementing prepared statements would provide an estimated 15% improvement for aspcore-*-my on the TFB benchmarks, so it's definitely a higher priority for me than hypothetical pipelining improvements.

roji commented 6 years ago

Huh, interesting that the protocols are so similar in that way... I guess both started out from a simple text-based protocol and later added the advanced features as a new sub-protocol...

Implementing preparation definitely sounds like a high priority perf gain, you're welcome to help yourself to Npgsql's query parser if it can help you out.

benaadams commented 6 years ago

It is not permitted to include multiple SQL statements separated by semicolons, so Npgsql performs rudimentary SQL parsing client-side, splitting the CommandText on semicolons and sending the above set of messages for each parsed SQL statement.

Interesting... as the final command in data updates is 20 update statements separated by semicolons. What would the best approach be to send this as a batch of commands? (From user api side)

benaadams commented 6 years ago

In ADO.NET terms this would mean invoking DbCommand.Execute*() while another command is still running. This isn't currently supported, and AFAIK there's no TechEmpower test which potentially allows this to be exercised (but I may be mistaken).

Querys (20) would be such a test; also data updates (as that is queries + updates); and spawning them on multiple tasks. I tried it with Mono, the only safe way I saw at the time was to use multiple connections for the concurrency - however multiplying up the connection count made it very unhappy so I had to back it out https://github.com/TechEmpower/FrameworkBenchmarks/pull/3784/files

roji commented 6 years ago

As things currently stand, you would pack them all into a single command's CommandText and execute. Npgsql splits them and send via separate protocol messages, which potentially fit into a single outgoing TCP packet, if the updates are small enough. https://github.com/dotnet/corefx/issues/3688 tracks adding a better API for batching which would possibly obviate client-side parsing.

In addition, preparation should be used, it has a significant impact for PostgreSQL.

I admit I haven't looked much at the update tests as the focus was on fortunes, I'll try to take ask a look and see where we are on the rest.

roji commented 6 years ago

I'll take a look at Querys and the other db benchmarks this weekend to see what can be improved

benaadams commented 6 years ago

dotnet/corefx#3688 tracks adding a better API for batching which would possibly obviate client-side parsing.

So that would be allowed by the updates part of data updates; however would be disallowed by the queries part as the queries need to be sent by the app as non-batched (though driver is allowed to then batch them)

However you could; validly, do overlapping queries e.g.

var task1 =  cmd1.ExecuteReaderAsync();
var task2 =  cmd2.ExecuteReaderAsync();
var task3 =  cmd3.ExecuteReaderAsync();
var task4 =  cmd4.ExecuteReaderAsync();
var task = await Task.WhenAny(task1, task2, task3, task4);
using (var rdr = await task)
{
    // ....
}

benaadams commented 6 years ago

Npgsql splits them and send via separate protocol messages, which potentially fit into a single outgoing TCP packet, if the updates are small enough.

Does it split the string and create more strings? Or just serialize in chunks to the wire?

roji commented 6 years ago

So that would be allowed by the updates part of data updates; however would be disallowed by the queries part as the queries need to be sent by the app as non-batched (though driver is allowed to then batch them)

However you could; validly, do overlapping queries

If I understand you correctly, you're saying that this is disallowed:

var cmd = new NpgsqlCommand("SELECT 1; SELECT 2", conn);
var reader = cmd.ExecuteReader();

While the following is allowed:

var cmd1 = new NpgsqlCommand("SELECT 1", conn);
var cmd2 = new NpgsqlCommand("SELECT 2", conn);
var reader1 = cmd.ExecuteReader1();
var reader2 = cmd.ExecuteReader2();

If so, that seems a bit ridiculous - they're equivalent in almost any way, at least perf-wise (again, assuming a single producer, no multiplexing i.e. handling a single HTTP request)... The only user-facing difference would seem to be error handling (i.e. does the 2nd query get executed if the first fails). The first is already supported - Npgsql and PostgreSQL don't care if the batched statements are updates or selects - while the 2nd isn't. Can you point to the part in the TechEmpower docs which says the 1st is disallowed while the 2nd is OK?

Npgsql splits them and send via separate protocol messages, which potentially fit into a single outgoing TCP packet, if the updates are small enough.

Does it split the string and create more strings? Or just serialize in chunks to the wire?

At the moment it splits, creating more strings (not ideal). Unfortunately there's another unrelated aspect that requires this: PostgreSQL parameter placeholders are positional ($1, $2), while historically ADO.NET (and therefore Npgsql) have allowed named placeholders (@p1, @p2). So in addition to splitting on semi-colons, the client-side SQL parsing phase also rewrites placeholders. https://github.com/npgsql/npgsql/issues/1042 is precisely about avoiding all this - if we allow applications to use native PostgreSQL positional placeholders, and in addition solve the batching API issue (https://github.com/dotnet/corefx/issues/3688), then Npgsql can stop parsing and rewriting the SQL entirely. There's also a chance that Span<char> can help reduce allocations now without such a big change.

There's one more aspect interacting with all this - prepared statements. If you call Prepare() on a command, all of the above parsing (and server preparation) happens at parse-time, and subsequent calls to Execute() avoid all of this. Unfortunately many applications (including the TE scenarios) have short-lived connection lifecycles, so you must instantiate a new command on each HTTP request. To still allow such applications to benefit from prepared statements, Npgsql persists prepared statements, which means that the server-side prepared statements doesn't get freed when the connection is returned to the pool. The next time NpgsqlCommand.Prepare() is called, we look up the each statement in an internal dictionary, and if we find that this particular SQL has already been prepared on this particular physical connection, we skip the preparation phase. This is a pretty major perf optimization (prepared statements are critical in PostgreSQL); however, it does require Prepare() to parse the CommandText to split on semicolons, so that each SQL can be looked up in the dictionary...

https://github.com/npgsql/npgsql/issues/1701 was about caching the raw SQL of the entire command (CommandText) rather than individual statements' SQLs, precisely to avoid having to parse CommandText when calling Prepare() on a command SQL that is already prepared (i.e. just do a single dictionary lookup on CommandText). This turned out to be quite complicated (two levels of caching - one for the commands, one for the statements) and the perf benefits simply didn't seem to justify it. And again, all this is basically a workaround for the fact that ADO.NET has no sane batching API and requires concatenating statements into CommandText (https://github.com/dotnet/corefx/issues/3688). So I ended up dropping it...

The above is quite a lot of Npgsql internals to take in :) Let me know if you want any more details, and looking forward to hearing any ideas.

bgrainger commented 6 years ago

Can you point to the part in the TechEmpower docs which says the 1st is disallowed while the 2nd is OK?

https://www.techempower.com/benchmarks/#section=code

“Test type 3: Multiple database queries”, point 6:

It is not acceptable to execute multiple SELECTs within a single complex query. It is not acceptable to retrieve all required rows using a SELECT ... WHERE id IN (...) clause. However, note General Requirement 7 which permits pipelining of network traffic between the application and database.

“General Requirements”, point 7:

… However, it is permissible to pipeline traffic between the application and database as a network-level optimization. That is, two or more separate queries may be delivered from the application to the database in a single transmission over the network, and likewise for query results from the database back to the application, as long as the queries themselves are executed separately on the database server and not combined into a single complex query.

roji commented 6 years ago

@bgrainger hmm, thanks. However, I'm not sure that actually disallows SELECT 1; SELECT 2.

I think it can be claimed that SELECT 1; SELECT 2 isn't actually "multiple SELECTs within a single complex query": it's two different queries inside a single command (or batch); SELECT 1; SELECT 2 is nothing more than "piplining of network traffic between the application and database".

The restriction in point 6 seems to be more about using some SQL features to express the same thing in one query - see this comment. For example, it's forbidden to do select * from world where id in (2,3) instead of two selects. That restriction does make sense to me.

This seems supported by the following from the comment linked to above:

I don't know what Revenj does there. If I enabled that commented-out code and I enabled postgres's query log, would the query log show a separate "SELECT FROM world WHERE id = ?" statement for each query, or would it show a bulk query like "SELECT FROM world WHERE id IN ...?" If it's the latter, then no, that is still disallowed.

Sending a single NpgsqlCommand with SELECT 1; SELECT 2 will cause the PostgreSQL log to show two complete separate queries, which happened to be pipelined and combined into a single TCP packet.

So according to my interpretation this is allowed. I've posted a question to be absolutely sure.

bgrainger commented 6 years ago

I agree that the rules are a little ambiguous. If they wanted separate DB queries (and network roundtrips), it would be nice if there were a clear data dependency in the tests, i.e., the second query depended on the results of executing the first query and thus couldn't be pipelined.

bgrainger commented 6 years ago

FWIW, I just looked at the source code of some of the fastest MySQL entrants (in the Multiple Queries and Data Updates categories); AFAICT each of them is executing a separate SELECT * FROM World WHERE id = ? command, repeatedly in a loop. These entrants all appear to be synchronous so it seems unlikely to me that an ORM layer (I did not look at how that is implemented for each framework) is combining them into a single network packet behind-the-scenes. The biggest differences I could see from aspcore-ado-my are a) synchronous, and b) prepared statements.

benaadams commented 6 years ago

If I understand you correctly, you're saying that this is disallowed: ... While the following is allowed: ...

Nah, wasn't the point I was making. The second example you gave is more or less the same as doing it in a loop (as they are sync queries)

So the first is disallowed because it is directly batching in application code, and that's only allowed as a network optimisation by the driver:

7 . Except where noted, all database queries should be delivered to the database servers as-is and not coalesced or deduplicated at the database driver. In all cases where a database query is required, it is expected that the query will reach and execute on the database server. However, it is permissible to pipeline traffic between the application and database as a network-level optimization. That is, two or more separate queries may be delivered from the application to the database in a single transmission over the network, and likewise for query results from the database back to the application, as long as the queries themselves are executed separately on the database server and not combined into a single complex query.

What I was suggesting is using async; stashing aside the Task, so it could make requests concurrently (i.e. not awaiting for the results of the first query until a few more queries had been individually submitted)

roji commented 6 years ago

Even in async, if we compare this:

var cmd = new NpgsqlCommand("SELECT 1; SELECT 2", conn);
var reader = await cmd.ExecuteReaderAsync();

with this:

var cmd1 = new NpgsqlCommand("SELECT 1", conn);
var cmd2 = new NpgsqlCommand("SELECT 2", conn);
var task1 = cmd1.ExecuteReaderAsync();
var task2 = cmd2.ExecuteReaderAsync();
var task = await Task.WhenAny(task1, task2);

Once again, at the network level the above two fragments produce almost the exact same thing: a single roundtrip sending two queries, in almost identical wire protocol messages - there's no advantage in the 2nd fragment compared to the first one. This is also why I can't really see any sane reason for the 1st fragment to be disallowed in a benchmark, while the 2nd fragment would be allowed. So I don't really see the sync/async aspect as being very important here.

The only advantage of the 2nd fragment, is if you don't have all the queries you want to send upfront (e.g. because they're gradually coming in via some HTTP connection or something). The 2nd method would allow you to send each command as it's coming in, whereas the 1st only allows you to batch several upfront, but then wait until that batch is complete.

roji commented 6 years ago

One last comment, to reiterate on @bgrainger's earlier comment. If we look at this:

var task1 = cmd1.ExecuteReaderAsync();
var task2 = cmd2.ExecuteReaderAsync();
var task = await Task.WhenAny(task1, task2);

When does it make sense for task2 to complete? If it completes as soon as we're starting to receive the 2nd resultset (which is what the ADO.NET API suggests), then task2 can only complete after the 1st resultset has been consumed. In other words, the Task.WhenAny() above is totally useless, since we know task1 will always complete before task2.

There's nothing necessarily wrong with this, just pointing out that in the end there's a single TCP connection on which we're serially sending queries - there's no actual parallelism going on.

benaadams commented 6 years ago

For the sync version

var cmd1 = new NpgsqlCommand("SELECT 1", conn);
var cmd2 = new NpgsqlCommand("SELECT 2", conn);
var reader1 = cmd.ExecuteReader1();
var reader2 = cmd.ExecuteReader2();

The statement var reader2 = cmd.ExecuteReader2(); cannot be run until the the preceding one is complete.

Whereas for the async version it can.

Once again, at the network level the above two fragments produce almost the exact same thing: a single roundtrip sending two queries, in almost identical wire protocol messages - there's no advantage in the 2nd fragment compared to the first one.

Which is the point. Batching the queries as one command in the app as SELECT 1; SELECT 2 allows any framework to do it; whereas requiring them as separate statements tests a frameworks ability to do concurrency; which not all do, and not all do well.

Rephrasing; the 20 query test as an http type test. If instead you had to download 20 images from a remote site - it becomes a test as to whether you can do it in parallel and how well you can do that as well as how your download works.

Does it take 20 x the download time; or 1 x the longest download because they are done concurrently?

If the db driver then takes advantage of this an pipelines/batches them; that's great, but its an implementation detail outside of the code you are writing using the framework and the tests are about how code you can write with the framework performs (the db driver would count as part of framework rather than user code)

They are also looking to remove the database component to focus specifically on this concurrency:

Future: Tests that exercise requests made to external services and therefore must go idle until the external service provides a response. (High concurrency is desirable here.)

https://twitter.com/BrianHauerTSO/status/1000127582627545088

We are finding the variability in perfectly legitimate ways to communicate to databases (and variability of databases in general) make enforcing very strict uniformity of implementation unappealing. Perhaps better to do this with an intentionally rigid endpoint (e.g., jsonrpc).

benaadams commented 6 years ago

There's nothing necessarily wrong with this, just pointing out that in the end there's a single TCP connection on which we're serially sending queries - there's no actual parallelism going on.

Whether its parallel, concurrent, pipelined or serial sends (request->response->request) is an implementation detail of how its achieved.

The important factor is its not the user code doing it - but the benefit of the framework for idiomatic(ish) user code.

benaadams commented 6 years ago

Hmm... reading the clarification: https://groups.google.com/forum/#!msg/framework-benchmarks/A78HrYsz4AQ/_zneyvo2CwAJ

For the kind of "pipelining" we do allow, it was my intent that multiple queries do NOT get batched together in such a way that the batching itself affects query semantics at all, especially with respect to failed queries. For example, what would happen if you changed that Npgsql command to "SELECT 1; SELECT foo bar; SELECT 2", where "foo bar" is nonsense that will cause that middle SELECT to fail? Will the "SELECT 2" query still be executed? If not, then that sort of batching violates requirements as I see them. Why? Because that's exploiting simplifications in the test scenarios in a way that goes against the spirit of the tests, where the same optimization wouldn't necessarily be a good idea in the kind of real production applications we're trying to simulate. Meanwhile, as I understand the pipelining that vertx-postgres is doing, there's no reason not to enable that kind of pipelining.

ADO allows multiple active result sets, so you could do

var cmd = new NpgsqlCommand("SELECT 1; SELECT 2", conn);
var reader = await cmd.ExecuteReaderAsync();
reader.ReadAsync();
reader.NextResult(); // move to next query
reader.ReadAsync();

And that would be valid if the invalid query in the middle of SELECT 1; SELECT foo bar; SELECT 2 doesn't cause SELECT 2 not to run and return results.

Question, would it? How would that present via the api? (The invalid query)

roji commented 6 years ago

@benaadams with the current behavior of Npgsql, if a statement in the middle of a batch fails then the rest of the statements are skipped by PostgreSQL. This is achieved via a feature of the PostgreSQL protocol, whereby the moment a protocol message generates an error, PostgreSQL will skip all subsequent messages until it sees a Sync message. So Npgsql will send SELECT 1; SELECT 2 as Parse1/Bind1/Describe1/Execute1/Parse2/Bind2/Describe2/Execute2/Sync (this is for unprepared queries, and all messages are potentially packed into a single TCP packet). If any of these messages generates an error, PostgreSQL will skip all the way to the sync.

In .NET API terms, you would be able to read resultsets and call reader.NextResult() until the problematic query, at which point calling reaader.NextResult() will generate an exception. But protocol sync will be maintained and the connection remains usable.

What I don't understand is why this failure semantics has anything to do with TechEmpower benchmark requirements - what possible interest is it of TechEmpower if Npgsql's batching logic skips remaining messages after error or not? In other words, should a database driver utilization be crippled in the benchmark just because it provides batching semantics which skip after error? What exactly would be the point of that?

Once again, I do understand the point of disallowing a single complex SQL statement such as SELECT x WHERE id IN (x, y, z) in place of three simple ones (SELECT x WHERE id=x...). But I don't understand the point of forbidden batching if it happens to have a certain skip-after-error behavior.

Out of curiosity, @benaadams, when executing two commands asynchronously with pipelining, would you expect the driver to retain the current streaming behavior, forcing querys' resultsets to be consumed in the same order that they were sent? Or would you expect resultsets to be buffered in memory, allowing the 2nd resultset to be consumed before the 1st? This question is irrelevant for multiplexing where buffering is clearly mandatory.

Finally, regardless of TechEmpower rules... Just to make sure I'm clear, I don't really have objections to pipelining or even multiplexing - they do have their perf benefits (although pipelining by a single client seems to be helpful only for a scenario which I find relatively rare - requests gradually coming in and being gradually pipelined on the same physical db connection). It definitely is a considerable departure from the current driver model, and I'm still not 100% sure I understand the perf impacts (i.e. what is the perf penalty of switching from the current streaming behavior to buffering?). But it's definitely an interesting direction.

At some point we may want to explore a design where a database driver has a low-level send/receive API (somewhat similar to libpq), on top of which we would be able to build either a classical ADO.NET model or a multiplexing layer.

benaadams commented 6 years ago

What I don't understand is why this failure semantics has anything to do with TechEmpower benchmark requirements - what possible interest is it of TechEmpower if Npgsql's batching logic skips remaining messages after error or not?

Because the semantics of the test are a for a set of independent queries:

execute Select 1
read
execute Select 2
read

If a failure in one query causes the following queries not to execute then they aren't being run as independent queries; so its a semantically different behaviour - therefore not a valid optimisation to be equivalent i.e. its an observable change in behaviour (though likely the correct behaviour for an actual combined query?)

Out of curiosity, @benaadams, when executing two commands asynchronously with pipelining, would you expect the driver to retain the current streaming behavior, forcing querys' resultsets to be consumed in the same order that they were sent?

I was expecting it to fail horribly; as it would likely be stomping over state, which is why for the Mono version I opened multiple connections instead and then distributed the queries amongst them. However that pushed the connection count from 512 to 5120 and executed them in parallel - which just made everything unhappy - so I had to back it out.

roji commented 6 years ago

What I don't understand is why this failure semantics has anything to do with TechEmpower benchmark requirements - what possible interest is it of TechEmpower if Npgsql's batching logic skips remaining messages after error or not? Because the semantics of the test are a for a set of independent queries

That's exactly what I have a hard time accepting. So if Npgsql happened to implement its batching by sending Parse1/Bind1/Describe1/Execute1/Sync/Parse1/Bind1/Describe1/Execute1/Sync (note that additional Sync in the middle), then it would be eligible for using this optimizations in the benchmarks? How does that make sense, who benefits from this seemingly arbitrary exclusion? Would it make sense for Npgsql to have another batching mode without the skip-on-error logic so that batching can be used in TE? To be clear, I understand how what you're saying can be understood from the TE rules, but I'm questioning the sense of those rules.

The way I see it, forbidding the "complex query" case makes sense for preventing one benchmark implementation from things like using SQL IN for reducing multiple queries into one (especially since that may not be possible with some NoSQL databases). However, the fact that a batch skips-on-error is very different from that, and in my mind cannot be considered a single complex query in any way.

In any case, this discussion needs to happen with the TechEmpower folks rather than with you @benaadams (although it's interesting to hear your opinion). I'll post the arguments above in the Google discussion tomorrow.

roji commented 6 years ago

I think we can leave the TechEmpower eligibility questions to the thread where Io posted, but it's still interesting to discuss batching/pipelining/multiplexing here...

I'm specifically interested in what people think about the impact of switching to buffering of resultsets as a result of multiplexing and possibly some forms of pipelining (since resultsets aren't consumed in order). I tend to think that streaming is very important when compared to buffering, but I know EF Core does buffer in some common cases. What are people's intuitions (or experience) about the degradation caused by buffering?

bgrainger commented 6 years ago

Most of my (micro) ORM experience is with Dapper and they buffer by default.