dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

New System.Data.Common batching API #28633

Closed roji closed 3 years ago

roji commented 5 years ago

This proposal has been approved in principle, subject to confirmation by several database providers that the API shape is good.

Issues tracking provider implementations:


This issue is based on previous discussions in dotnet/runtime#15375 and dotnet/runtime#17445.

Background

Batching multiple SQL statements can be critical for database performance, as they can be executed in a single roundtrip rather than waiting for each statement to complete before sending the next one. System.Data.Common doesn't currently provide a first-class API for this, but some providers (e.g. SqlClient) allow batching statements by concatenating them together into DbCommand.CommandText, separated by semicolons:

var cmd = connection.CreateCommand();
cmd.CommandText = "INSERT INTO table (1); INSERT INTO table (2)";

The problem with this approach, is that most databases require separate protocol messages for each statement (e.g. PostgreSQL, MySQL), forcing the database's ADO.NET provider to parse the SQL client-side and to split on semicolons. This is both unreliable (parsing SQL is difficult) and bad for performance - ideally an ADO.NET provider should simply forward user-provided SQL to the database, without parsing or rewriting it (see dotnet/runtime#25022 for a related issue forcing providers to parse SQL).

As a specific case, SQL Server does natively support the multi-statement commands, but even there this has drawbacks for performance, detailed by @GSPP in dotnet/SqlClient#19. Also, as @bgribaudo pointed out, the current concatenation-based approach doesn't allow invoking multiple stored procedures in batching fashion. SqlClient does include SqlCommandSet, which performs efficient, non-concatenation-based batching, but that class is internal and currently usable only via DbDataSet, and not for general usage (NHibernate apparently accesses this via reflection). This proposal would allow exposing SqlCommandSet via a public API.

Goals

Proposed API

public abstract class DbBatch : IDisposable, IAsyncDisposable
{
    public DbBatchCommandCollection BatchCommands => DbBatchCommands;
    protected abstract DbBatchCommandCollection DbBatchCommands { get; }

    public DbBatchCommand CreateBatchCommand() => CreateDbBatchCommand();
    protected abstract DbBatchCommand CreateDbBatchCommand();

    #region Execution (mirrors DbCommand)

    // Delegates to ExecuteDbDataReader
    public DbDataReader ExecuteReader(CommandBehavior behavior = CommandBehavior.Default);
    protected abstract DbDataReader ExecuteDbDataReader(CommandBehavior behavior);

    // Delegate to ExecuteDbDataReaderAsync
    public Task<DbDataReader> ExecuteReaderAsync(CancellationToken cancellationToken = default);
    public Task<DbDataReader> ExecuteReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken = default);

    protected abstract Task<DbDataReader> ExecuteDbDataReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken);

    public abstract int ExecuteNonQuery();
    public abstract Task<int> ExecuteNonQueryAsync(CancellationToken cancellationToken = default);

    public abstract object ExecuteScalar();
    public abstract Task<object> ExecuteScalarAsync(CancellationToken cancellationToken = default);

    #endregion

    #region Execution properties (mirrors DbCommand)

    public abstract int Timeout { get; set; }

    // Delegates to DbConnection
    public DbConnection Connection { get; set; }
    protected abstract DbConnection DbConnection { get; set; }

    // Delegates to DbTransaction
    public DbTransaction Transaction { get; set; }
    protected abstract DbTransaction DbTransaction { get; set; }

    #endregion

    #region Other methods mirroring DbCommand

    public abstract void Prepare();
    public abstract Task PrepareAsync(CancellationToken cancellationToken = default);
    public abstract void Cancel();

    #endregion

    #region Standard dispose pattern

    public void Dispose() { ... }
    protected virtual void Dispose(bool disposing) {}

    #endregion
}

public class DbBatchCommandCollection : Collection<DbBatchCommand>
{
}

public abstract class DbBatchCommand
{
    public abstract string CommandText { get; set; }
    public abstract CommandType CommandType { get; set; }
    public abstract int RecordsAffected { get; }

    public DbParameterCollection Parameters => DbParameterCollection;
    protected abstract DbParameterCollection DbParameterCollection { get; }
}

public class DbConnection
{
    public DbBatch CreateBatch() => CreateDbBatch();
    protected virtual DbBatch CreateDbBatch() => throw new NotSupportedException();

    public DbBatchCommand CreateBatchCommand() => CreateDbBatchCommand();
    protected virtual DbBatchCommand CreateDbBatchCommand() => throw new NotSupportedException();

    // Covers both CreateBatch and CreateBatchCommand
    public virtual bool CanCreateBatch => false;
}

public class DbProviderFactory
{
    public virtual DbBatch CreateBatch() => throw new NotSupportedException();
    public virtual DbBatchCommand CreateBatchCommand() => throw new NotSupportedException();

    // Covers both CreateBatch and CreateBatchCommand
    public virtual bool CanCreateBatch => false;
}

public class DbException
{
    public DbBatchCommand BatchCommand => DbBatchCommand;
    protected virtual DbBatchCommand DbBatchCommand => null;
}

General usage and examples

Usage is fairly trivial and aligned with the existing DbCommand APIs. Users first create a new DbBatch, either by calling DbCommandFactory.CreateBatch(), or by instantiating one directly. Commands are added into the batch, execution properties (e.g. Timeout) are set on it, and finally on of the Execute*() methods are called on it. Connection, Transaction and Timeout are specified on the DbBatch, like they are set on DbCommand for un-batched operations.

Here is a code sample using DbProviderFactory for database portability:

using var batch = dbProviderFactory.CreateBatch();

var cmd1 = dbProviderFactory.CreateBatchCommand();
cmd1.CommandText = "UPDATE table SET f1=@p1 WHERE f2=@p2";
var p1 = dbProviderFactory.CreateParameter();
var p2 = dbProviderFactory.CreateParameter();
p1.Value = 8;
p2.Value = 9;
cmd1.Parameters.Add(p1);
cmd1.Parameters.Add(p2);
batch.Add(cmd1);

var cmd2 = dbProviderFactory.CreateBatchCommand();
cmd2.CommandText = "SELECT * FROM table WHERE f2=@p3";
var p3 = dbProviderFactory.CreateParameter();
p3.Value = 8;
cmd2.Parameters.Add(p3);
batch.Add(cmd2);

batch.Connection = conn;
batch.Transaction = transaction;

using var reader = batch.ExecuteReader();
// read contains one resultset, from SELECT

The verboseness of this API corresponds to how ADO.NET currently looks. General usability improvements are planned for later.

Here is a suggested code sample working against a specific provider and using initializers for better readability and terseness:

using var batch = new XyzBatch
{
    Connection = conn,
    Transaction = transaction,
    BatchCommands =
    {
        new XyzBatchCommand("UPDATE table SET f1=@p1 WHERE f2=@p2")
        {
            Parameters =
            {
                new XyzParameter("p1", 8),
                new XyzParameter("p2", 9),
            }
        },
        new XyzBatchCommand("SELECT * FROM table WHERE f2=@p1")
        {
            Parameters =
            {
                new XyzParameter("p1", 8),
            }
        }
    }
};

using var reader = batch.ExecuteReader();
// read contains one resultset, from SELECT

Design and naming motivations

The original proposal had the batch holding a list of DbCommand instead of the new DbBatchCommand type. The change, originally proposed by @bgribaudo, was motivated by the following reasons:

The name DbBatchCommand will hopefully convey the similarity between that type and DbCommand (both hold SQL, parameters and a CommandType), while at the same time keeping them distinct (DbBatchCommand isn't executable outside of a DbBatch). The name DbStatement was also considered, although it seems this would introduce more confusion:

Affected rows

DbCommand has APIs for finding out how many rows were affected by the command:

However, both of these provide an aggregate number; if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately.

Providing non-aggregate affected rows could be done via an overload of ExecuteNonQuery[Async]() which returns an int[], or which accepts a user-provided int[] and populates it. The approaches have their complexities (perf, array management...), and as we're introducing a new DbBatchCommand dedicated to batching, we propose to simply add a RecordsAffected property on it instead. The provider would populate this property for each DbBatchCommand to indicate how many rows were affected by that command.

As with DbDataReader.RecordsAffected, this property would contain -1 for SELECT statements, and 0 if no rows were affected or the statement failed.

Command behavior

DbCommand.ExecuteReader() accepts a CommandBehavior enum which allows execution behavior to be tweaked. Even if it seems to be a rare case, users may need to specify different behaviors for different batch commands. As a result, DbBatch.ExecuteReader() does not accept a behavior parameter, and DbBatchCommand has a CommandBehavior property instead.

This requires the user to set any non-default behavior on each and every batch command. A batch-wide default could be accepted by DbBatch.ExecuteReader(), but that would require a way to distinguish between the batch default and the enum's default value (CommandBehavior.Default), e.g. by making it nullable on the DbCommandBatch. The introduced complexity doesn't seem to be worth it.

Note that not all providers will support all command behaviors on batched commands - it is expected that in some cases the entire batch will have to share the same behavior. Also, CommandBehavior.CloseConnection makes no sense on batched commands except the last, and the provider should probably throw on this.

DbException.Command

Since we now execute batches, we should provide a way of letting users know which command in the batch triggered an exception. We propose to do this by introducing a new virtual property on DbException, called BatchCommand, pointing to the relevant DbBatchCommand instance. In non-batched command execution this property would be left unassigned.

Notes on batch size management

In some cases, there may be some value in having the provider's batching implementation implicitly manage batch sizes. Examples:

The pro here is to free the user from knowing low-level, database-specific details, transparently increasing perf and reducing friction for everyone (we're assuming the user has no special knowledge which would assist in the decision-making). The con here is more complexity and a general uncertainty that this is needed - any examples from the community would be helpful.

Regardless, this can be viewed as an implementation detail of a provider; although we can provide guidelines, at the end of the day each provider will implement the behavior they desire.

Backwards-compatibility

As a new API, no breakage is being introduced, but it has been proposed to provide a backwards compatibility shim for the new API that either performs concatenation-based batching, or simply executes without batching at all.

As of now, the decision is to not provide any sort of shim: trying to use the new batching API with providers that haven't implemented support for it will throw a NotSupportedException. The reasons for this are:

Feature detection

As the API will throw for providers which don't implement it, a way must be provided for consumers to know whether the API is supported. The proposal is to add a CanCreateBatch bool property to both DbProviderFactory and DbConnection, alongside the CreateCommandSet() methods.

Additional Notes

Open questions

Edit history

Date Modification
2019-02-14 DbCommandSet now implements IEnumerable<DbCommand> and includes the two GetEnumerator() methods, as per @bgrainger's suggestion
2019-02-18 Updated proposal with different concrete options for backwards compatibility, batch size management, and added a usage example.
2019-02-22 Updated with the current decision on backwards compatibility (no shim), added feature detection discussion, added non-aggregated rows affected to goals.
2019-02-23 Added missing DbConnection.CreateCommandSet() and renamed DbProviderFactory.CreateCommandSet() (it previously had a Db which doesn't belong). Added CanCreateCommandSet property to DbProviderFactory, DbConnection.
2019-02-25 Added public non-virtual DbConnection.CreateCommandSet() along protected virtual CreateDbCommandSet()
2019-03-04 Added DbConnection, DbTransaction to DbCommandSet. Corrected ExecuteReaderAsync overloads, corrected some comments.
2019-03-06 DbCommandSet now includes a Commands list instead of implementing IEnumerable<DbCommand> itself.
2019-03-12 Renamed DbCommandSet to DbBatch, and changed it to hold a new DbBatchCommand type instead of DbCommand. Added readable/terse code sample, added explanatory notes and rearranged sections slightly.
2019-04-18 Added CommandBehavior to DbBatchCommand and removed the commandBehavior parameter to DbBatch.ExecuteReader(); added a section to explain. Added RecordsAffected to DbBatchCommand and updated the affected rows section.
2019-04-20 Final typo pass (thanks @bgrainger) and added sentence on CommandBehavior support.
2019-05-14 Note on values of RecordsAffected (for SELECT, for failed statements), add standard Dispose pattern methods to DbBatch, fixed issues raised by @bgrainger
2019-05-23 Apply standard ADO.NET pattern to DbBatch with virtual ExecuteDbDataReader and non-virtual ExecuteDataReader
2019-06-17 Removed DbBatch.CancelAsync() following removal of DbCommand.CancelAsync() in dotnet/runtime#28596, thanks @Wraith2. Replaced IList<DbBatchCommand> with DbBatchCommandCollection.
2021-06-21 Make DbException.DbBatchCommand protected
2021-06-29 Move CommandBehavior from DbBatchCommand to DbBatch.ExecuteReader
2021-07-01 Remove the setter for DbBatchCommand.RecordsAffected
2021-07-01 Added DbBatch.CreateBatchCommand and CreateDbBatchCommand
roji commented 5 years ago

/cc @divega @ajcvickers @bgrainger @ErikEJ @bricelam @AndriySvyryd

divega commented 5 years ago

Thanks @roji for putting this together. A couple of points:

bgrainger commented 5 years ago

Typo: IAsyncDisposal.

Will DbCommandSet implement the Dispose pattern (i.e., protected virtual Dispose(bool disposing), or public virtual Dispose() { /* no op */ }, or force all derived types to implement both Dispose and DisposeAsync (to satisfy the interfaces' requirements)?

Is it worth implementing IEnumerable a) to retrieve the commands in the batch, b) to enable C# collection initialization syntax?

Backwards-compatibility shim

Add:

  • Prepare[Async]() will simply call the same method on each of commands in the batch.

Or should it call the same method on the batch command it creates? (I don't understand the benefit of preparing the DbCommand objects in the DbCommandSet, which won't actually be executed themselves.)


MySQL implementation thoughts:

For the "text" protocol (the default), MySQL already supports all commands in a batch being sent in a single network packet. MySqlConnector would probably end up concatenating all the SQL from the DbCommandSet into one payload (maybe by performing multiple UTF-8 conversions into one output buffer, not by really concatenating actual C# strings). The DbCommandSet may not provide a practical enhancement for most MySQL users (who use the "text" protocol). Additionally, concatenating all the SQL means that it would be highly unlikely that DbException.Command could be set correctly (just as with the backwards-compatibility shim).

For the "binary" protocol (i.e., prepared commands), this would be a helpful API improvement. MySQL disallows more than one statement being prepared at once, so MySqlConnector has to parse SQL and split commands apart. The design goals of this proposal eliminate that requirement.

However, for MySQL, the second command in the batch wouldn't actually be sent until DbDataReader.NextResult[Async] is called. (I.e., DbCommandSet.ExecuteReaderAsync doesn't send all "binary protocol" commands in one network payload.) I don't think that would cause any problems with this proposal but executing “multiple SQL statements in a single roundtrip” would not be possible.

bgribaudo commented 5 years ago

@roji, thank you for writing this up. I am excited about the possibilities!

Currently, the proposal allows command behavior to be specified once for the entire set of commands. However, command behavior can be command-specific. The same behavior may not apply to all commands that could go into a command set. To accommodate this, what would you think of changing the proposal so that behavior may optionally be specified when commands are added (e.g. Add(someCommand, CommandBehavior.KeyInfo))?

roji commented 5 years ago

@divega maybe it makes sense to repurpose dotnet/SqlClient#19 to track the implementation of the new batching API in SqlClient specifically? That seems to be its main focus in any case. Regarding the additional overload accepting an int[] for records affected, I'm not sure what you mean anymore... The proposal above includes an overload which returns int[], but of course feel free to edit and add. I think this is one of the main open questions which still need to be resolved.

roji commented 5 years ago

@bgrainger

Typo: IAsyncDisposal.

Thanks, fixed.

Will DbCommandSet implement the Dispose pattern (i.e., protected virtual Dispose(bool disposing), or public virtual Dispose() { /* no op */ }, or force all derived types to implement both Dispose and DisposeAsync (to satisfy the interfaces' requirements)?

Good question. Most of the ADO.NET base classes implement System.ComponentModel.Component, which implements IDisposable with the dispose pattern (i.e. virtual empty implementation of Dispose(bool)). It definitely seems like DbCommandSet should also implement that pattern as well (possibly by extending Component). ~I'm guessing that the same pattern applies for IAsyncDisposable - I'll look into this soon and update the proposal. Let me know what you think.~

Is it worth implementing IEnumerable a) to retrieve the commands in the batch, b) to enable C# collection initialization syntax?

Richer access to the commands in the set is one thing I wanted to get feedback on. I'm not sure it makes sense for the command set to expose full list semantics - what's your view?

Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with IEnumerable (which says nothing about addition/modification); all it requires is for the type in question to have an Add() method, which we already have in the proposal above.

The shim will set the CommandTimeout, Connection and Transaction properties on the batch command to the values of its corresponding properties.

Are you referring only to the shim, or to the DbCommandSet spec in general? For the latter, I'm not sure this makes much sense - what value does it provide over clearly specifying that they are ignored in batched execution? In addition, setting CommandTimeout could be construed as meaning that each command has its own timeout, which definitely doesn't work in the general batching case. So all in all, unless there's a persuasive reason, I'd rather we just ignored these.

  • Prepare[Async]() will simply call the same method on each of commands in the batch.

Or should it call the same method on the batch command it creates? (I don't understand the benefit of preparing the DbCommand objects in the DbCommandSet, which won't actually be executed themselves.)

You're right, this was a mistake. In theory it's possible to prepare the batch command as you suggest, and reuse it again with different parameter values. The issue is how to invalidate if any of the user-provided commands are modified (e.g. change in CommandText) - it doesn't seem like it will be possible to handle this by the shim. It's probably better to simply not do anything here.

Thanks for your comments on the MySQL provider. To be sure I'm understanding things correctly, here are a few questions:

roji commented 5 years ago

@bgribaudo,

Currently, the proposal allows command behavior to be specified once for the entire set of commands. However, command behavior can be command-specific. The same behavior may not apply to all commands that could go into a command set. To accommodate this, what would you think of changing the proposal so that behavior may optionally be specified when commands are added (e.g. Add(someCommand, CommandBehavior.KeyInfo))?

Most of the value on CommandBehavior seem like they make sense only at the batch level - CloseConnection, SingleResult, SingleRow, SequentialAccess (since a single reader is returned, and would not likely support switching between sequential and non-sequential between commands across providers). A good way to think about CommandBehavior is that actually affects the reader, rather than the command - it's a parameter to DbCommand.ExecuteReader(), and cannot be specified where there is no reader (ExecuteNoQuery(), ExecuteScalar()). In batched mode, we're still going to have a single reader (possibly with multiple resultsets), so multiple CommandBehaviors don't really seem to fit well.

bgrainger commented 5 years ago

Will DbCommandSet implement the Dispose pattern

In favour of implementing the standard Dispose pattern for consistency. I have no experience with IAsyncDisposable but would lean towards implementing the same pattern if one exists.

Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with IEnumerable

I believe you are mistaken; see docs.

Perhaps unfortunately, this does immediately imply access to the commands in the collection, which feels like a distraction from the point of this class. It does already have Add and Clear, and Count might make sense, which is halfway to ICollection<T>. But Contains, CopyTo, and Remove feel completely unnecessary, make implementations more complex, and imply some kind of equality for DbCommand objects. (This would probably end up just being reference equality, but is that what users expect?)

It doesn't feel like the point of DbCommandSet is to be a generic container of commands with full IList API. There's no obviously best answer here (to me), so I think I would lean towards keeping your proposed API, not implementing IEnumerable, and giving up C# collection initialisation until it becomes clear that that would be a high-value addition for typical users of the API. (I don't yet have a good sense for who will be the typical user; if it ends up primarily being used in ORM implementations, then some C# syntax sugar is less important.)

The shim will set the CommandTimeout, Connection and Transaction properties

Are you referring only to the shim

Yes, I was suggesting that you specify the behaviour of the shim implementation w.r.t. those properties. I think the shim should do something with them when creating its batch command, and this seems like the right thing to do.

Are parameters supported in the text protocol?

No, they are not. MySqlConnector parses the CommandText, finds ? or @param tokens (with the minimum smarts necessary to not detect these in string literals, comments, etc.) then substitutes in the (escaped) parameter values.

Is the choice between the text and binary protocols something that the user controls somehow?

Yes, through a convoluted mechanism. Binary protocol is implied by calling MySqlCommand.Prepare. However, because prepared commands support only a limited subset of SQL, and this might be surprising to users, Connector/NET added a IgnorePrepare option, which defaults to True. MySqlConnector implements the same convention. To use the binary protocol, you have to set IgnorePrepare=False, then call .Prepare() on each command.

so the MySQL binary protocol does not and cannot allow for batching at all

Correct. Each individual statement is given its own (integral) statement ID, and an "execute statement" command specifies a single statement ID. The client prepares all the statements in advance, then sends "execute statement ID 1", then reads the result set, then sends "execute statement ID 2", then reads its result set, etc.

Some MySQL Servers will permit pipelining, i.e., the client sending one network packet that contains multiple "execute statement ID n" payloads, then the server reading those one-at-a-time and processing them in order. However, there are a wide variety of proxies and backends (Aurora, Azure, TiDB, Percona, etc.) that speak the MySQL protocol and pipelining tends to break them. (It could possibly be added as an opt-in option for people who needed maximum performance and were confident their DB supported it.)

Also, this page seems to indicate that it may at least be possible to concatenate multiple statements in a single COM_STMT_PREPARE.

I have not found this to be true in practice. I thought I had once found documentation confirming that it's not supported but don't remember where that was. But that's a good point that some future MySQL Server might permit it.

divega commented 5 years ago

Regarding the additional overload accepting an int[] for records affected, I'm not sure what you mean anymore... The proposal above includes an overload which returns int[], but of course feel free to edit and add. I think this is one of the main open questions which still need to be resolved.

@roji no problem. I added what I described on my emails as option 2, since it is a variation of option 1.

maybe it makes sense to repurpose dotnet/SqlClient#19 to track the implementation of the new batching API in SqlClient specifically? That seems to be its main focus in any case.

Ok, done.

roji commented 5 years ago

Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with IEnumerable

I believe you are mistaken; see docs.

Sorry, my bad - you're right of course. That's what comes out of sloppy testing at 3AM.

Perhaps unfortunately, this does immediately imply access to the commands in the collection, which feels like a distraction from the point of this class. It does already have Add and Clear, and Count might make sense, which is halfway to ICollection<T>. But Contains, CopyTo, and Remove feel completely unnecessary, make implementations more complex, and imply some kind of equality for DbCommand objects. (This would probably end up just being reference equality, but is that what users expect?)

It doesn't feel like the point of DbCommandSet is to be a generic container of commands with full IList API. There's no obviously best answer here (to me), so I think I would lean towards keeping your proposed API, not implementing IEnumerable, and giving up C# collection initialisation until it becomes clear that that would be a high-value addition for typical users of the API. (I don't yet have a good sense for who will be the typical user; if it ends up primarily being used in ORM implementations, then some C# syntax sugar is less important.)

I agree with most of what you say, although I do think there's real value in providing collection initialization - maybe a good mid-way solution is to implement IEnumerable<DbCommand> but not IList<DbCommand>.

The shim will set the CommandTimeout, Connection and Transaction properties

Are you referring only to the shim

Yes, I was suggesting that you specify the behaviour of the shim implementation w.r.t. those properties. I think the shim should do something with them when creating its batch command, and this seems like the right thing to do.

OK, I understand, but I don't see why the shim should do anything with these properties - can you explain why you think that's important? The shim would obviously set those properties on the internally-used batch command it creates (since that's the one that actually executes), but why the others? Another possible issue is that it would be somewhat incoherent to have the shim do it, but not to require the same of actual provider implementations, and I don't really see the sense in that either...

Are parameters supported in the text protocol?

No, they are not. MySqlConnector parses the CommandText, finds ? or @param tokens (with the minimum smarts necessary to not detect these in string literals, comments, etc.) then substitutes in the (escaped) parameter values.

I see, so if I understand correctly you support formatting and parsing parameter values both in text (for unprepared) and in binary (for prepared)... Npgsql used to work the same way a long while ago. Note also dotnet/runtime#25022 which I plan to tackle soon, and which may also obviate the need to parse for parameter substitution/interpolation. The ultimate goal here is to allow you to write your driver with any sort of parsing/rewriting - hopefully that can work out for you. If there are any other reasons you parse/rewrite (aside from batching and parameter substitution), could you post a note about them?

Is the choice between the text and binary protocols something that the user controls somehow?

Yes, through a convoluted mechanism. Binary protocol is implied by calling MySqlCommand.Prepare. However, because prepared commands support only a limited subset of SQL, and this might be surprising to users, Connector/NET added a IgnorePrepare option, which defaults to True. MySqlConnector implements the same convention. To use the binary protocol, you have to set IgnorePrepare=False, then call .Prepare() on each command.

Thanks for this information - this kind of cross-provider/cross-database information is very important. In any case, it's unfortunate that MySQL imposes these restrictions...

so the MySQL binary protocol does not and cannot allow for batching at all

Correct. Each individual statement is given its own (integral) statement ID, and an "execute statement" command specifies a single statement ID. The client prepares all the statements in advance, then sends "execute statement ID 1", then reads the result set, then sends "execute statement ID 2", then reads its result set, etc.

Some MySQL Servers will permit pipelining, i.e., the client sending one network packet that contains multiple "execute statement ID n" payloads, then the server reading those one-at-a-time and processing them in order. However, there are a wide variety of proxies and backends (Aurora, Azure, TiDB, Percona, etc.) that speak the MySQL protocol and pipelining tends to break them. (It could possibly be added as an opt-in option for people who needed maximum performance and were confident their DB supported it.)

OK. IMHO batching is important enough for perf that it may make sense to provide an opt-in for this, although it's true that you also have the text protocol that already supports this. Regarding special backends and their capabilities, in the PostgreSQL case it's sometimes possible for the driver to detect this and automatically switch off/on certain features, maybe it's possible in your case as well (just thinking out loud).

Thanks again for all the detail and the valuable input into this - please don't hesitate to share any other thoughts on this proposal etc.

FransBouma commented 5 years ago

Looks great :)

For the shim: e.g. DB2 requires BEGIN/END around batched statements, so the shim itself needs to consult the ADO.NET provider whether it can simply concat the statements or needs to do additional things. Some databases don't support select statements batched together with update statements for instance, so your shim should anticipate on that, consulting the ADO.NET provider. Some databases don't support batching at all (access/oledb) so for these, the shim should simply execute them one by one.

When concatenating statements, you obviously run into the problem of parameters with the same name. You therefore have to make the parameter names unique. This can be problematic as it requires you to parse the statement (as far as I can see).

Most ADO.NET providers have no real limit on the # of parameters except... SQL Server! :) So you have to consult the ADO.NET provider there too in the shim how much statements you can batch in 1 go automatically based on the # of parameters. I don't think it's that important to offer a 'batch size' value for the shim.

This is for .NET Core x.y only? Or will this be available on .net full as well?

roji commented 5 years ago

@FransBouma

For the shim: e.g. DB2 requires BEGIN/END around batched statements, so the shim itself needs to consult the ADO.NET provider whether it can simply concat the statements or needs to do additional things. Some databases don't support select statements batched together with update statements for instance, so your shim should anticipate on that, consulting the ADO.NET provider. Some databases don't support batching at all (access/oledb) so for these, the shim should simply execute them one by one.

I wasn't planning on the shim actually implementing all the different behaviors, but it's an interesting thought... Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

When concatenating statements, you obviously run into the problem of parameters with the same name. You therefore have to make the parameter names unique. This can be problematic as it requires you to parse the statement (as far as I can see).

Yes, I noted this problem above in the proposal - my plan is currently to throw in this case. It definitely seems like a non-starter for the shim to parse SQL to uniquify parameter names (especially as different databases have different SQL dialects etc.). This is after all only a backwards-compat shim, and it seems reasonable for it not to work for some edge cases - of course provider implementations of DbCommandSet can do what they want.

Most ADO.NET providers have no real limit on the # of parameters except... SQL Server! :) So you have to consult the ADO.NET provider there too in the shim how much statements you can batch in 1 go automatically based on the # of parameters. I don't think it's that important to offer a 'batch size' value for the shim.

That's a good point. I'd expect this kind of thing to be managed as an implementation detail by SqlClient's future implementation of DbCommandSet (as mentioned above, SqlClient actually has an internal SqlCommandSet which actually isn't publicly accessible). I do agree that we shouldn't concern ourselves with this at the general API level. Of course it's also possible for the user to execute two batches instead of one in certain situation (not that passing this onto the user is a good idea).

This is for .NET Core x.y only? Or will this be available on .net full as well?

I don't have a definite answer for this yet, but as a general rule new APIs get added to .NET Core (or rather to .NET Standard) and not necessarily to .NET Framework.

FransBouma commented 5 years ago

@roji

Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported.

I likely overlooked it, but if you don't have this yet, I think if there are commands added to the DbCommandSet which are assigned to different connections it should throw.

bgribaudo commented 5 years ago

If I'm understanding this correctly, when a command set returns multiple result sets, the first result set is accessed by calling ExecuteReader() (or ExecuteReaderAsync()). Then, subsequent result sets are fetched by calling NextResult() on the last-returned DbDataReader instance (as returned by execute or the last call to NextResult()). All result sets returned by the set of commands are returned one after the other, with no distinction made as to which result set came from which command.

It's possible (at least in the SQL Server world) for a single command to return zero, one or multiple result sets--and the number returned can dynamically vary (e.g. a stored procedure might be programmed to return one result set under certain conditions and two under others). Making sense out of the returned result sets in a situation like this can get tricky/be impossible without knowing which set came from which command.

For example, let's say I execute a command set consisting of two commands, both of which call StoredProcA. This proc returns zero, one or two result sets depending on circumstances. If executing the command set returns four result sets, I know that the first two correspond with the first call to StoredProcA and the last two correspond with the second call. However, if, say, only two total result sets are returned, sorting out which came from where is a bit trickier--did they both come from the first proc invocation, the second invocation or did one come from each?

To accommodate situations like this, is something like the following an option?

public class DbCommandSet … {
  // executes each command in the set
  CommandSetResults Execute(); 
  Task<CommandSetResults> ExecuteAsync();
   … 
}

// corresponds with the results returned by a single command in the set
public class CommandSetResults {
   DbDataReader GetReader(); // returns data reader for first result set from command; subsequent result sets from the command are accessed by calling LastResults() on the last-returned data reader instance
   object GetScaler(); 
   CommandSetResults CommandSetResults(); // returns a CommandSetResults for the next command's set of result sets, similar to how DbDataReader.NextResults() is used to get the next result set from a single command
   int AffectedRows { get; }
}

With an API like this, the developer passes in a set of commands then fetches out a set of command results--the granularity passed in is the same as what is returned. It also eliminates the need to figure out how to return a combined records effected count (because that number is available on a per-command basis) and potentially simplifies the exception situation (an exception associated with a command's results can be thrown while that command's results are being worked with, eliminating the need to figure out how to identify how to tag the exception to indicate which command triggered it).

GSPP commented 5 years ago

if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately

This is critical for ORMs.

In general, this new API should expose all the raw power that is there. It should not drop information or limit the possible actions. This API is for consumers that maximally care about performance and flexibility.


The int[] recordsAffectedBreakdown allocation seems insignificant compared to the cost of making a database call. Even SELECT NULL is like 10000x the cost of that allocation. It's also G0 garbage.


Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well.


Is the timeout considered global or per command? This was not clearly stated as far as I can tell. IMO, it should be a global timeout. Most of the time, you want a timeout for the entire business transactions. If you expect creating a customer order takes 100ms then you want to abort if the entire thing takes e.g. 10s. You don't care about the duration of individual queries at all.


Feel free to close my now superseded issue https://github.com/dotnet/corefx/issues/31070. I am very happy to see progress on this. This is going to a major performance improvement.

roji commented 5 years ago

@FransBouma

Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported.

The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support. Providers which do simply have their own implementation of DbCommandSet which does not derive from the shim - they can do whatever it is they want. Thus, DB2 would automatically add BEGIN/END, SqlClient would split the batch when there are too many parameters, etc.

This raises the question of how desirable it is for the shim to start implementing provider-specific logic, which really belongs in the provider implementations. Some basic functionality may make sense: knowing that some specific providers simply don't support batching, and executing the command sequentially could be useful for the transitional period. But ultimately the expectation is for providers to provide their own implementation, totally unrelated to the shim.

I likely overlooked it, but if you don't have this yet, I think if there are commands added to the DbCommandSet which are assigned to different connections it should throw.

The idea was simply to ignore the Connection property on the commands in the set, as well as the Transaction and Timeout properties - only those on the DbCommandSet are used.

GSPP commented 5 years ago

Maybe the provider should be able to tell if it does support batching and other parameters such as what the maximum efficient number of commands and parameters is (before splitting needs to happen). Callers such as EF could interrogate the provider and adapt their behavior.

We'd need to identify all the common but relevant properties in which providers differ.

roji commented 5 years ago

@GSPP

if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately

This is critical for ORMs.

In general, this new API should expose all the raw power that is there. It should not drop information or limit the possible actions. This API is for consumers that maximally care about performance and flexibility.

We're in agreement, the question is really about how best to expose the non-aggregated rows affected.

The int[] recordsAffectedBreakdown allocation seems insignificant compared to the cost of making a database call. Even SELECT NULL is like 10000x the cost of that allocation. It's also G0 garbage.

That may be true in the general case, but don't forget that this is an API abstraction that's also used to contact local and even in-memory databases such as Sqlite. We've also seen that when fetching cached data from PostgreSQL over a low-latency 10Gbps network link, even minor-seeming allocations start to have an effect in the TechEmpower benchmarks. I agree that these cases may be somewhat artificial and not related to real-world usage, but I'm generally averse to introducing APIs that necessarily cause allocations if there's another way.

Is the timeout considered global or per command? This was not clearly stated as far as I can tell. IMO, it should be a global timeout. Most of the time, you want a timeout for the entire business transactions. If you expect creating a customer order takes 100ms then you want to abort if the entire thing takes e.g. 10s. You don't care about the duration of individual queries at all.

I agree - and per-command timeouts are probably not technically possible in at least some (or even most) of the providers. There's a note on all this in the proposal above - CommandTimeout is ignored on the individual commands (like Connection and Transaction), and DbCommandSet has its own Timeout property.

Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well.

This is of course very provider-specific; the general batching API doesn't define any mandatory behavior but definitely should not get in the way either (there's already a note in the proposal above). I personally think that it makes more sense to abort the batch once an error occurs, but if continuing is the desired behavior for a specific provider (e.g. SQL Server), then it could throw an aggregate exception containing all the individual exceptions at the end of batch execution. At the moment I don't see anything in the proposal that needs to be changed, let me know if you see a problem.

Feel free to close my now superseded issue dotnet/SqlClient#19. I am very happy to see progress on this.

As @divega wrote on that issue, it's still useful to keep it to track the implementation of the SqlClient DbCommandSet - presumably the existing internal SqlCommandSet would be changed to conform to the new API being shaped here. So I think we should keep it open for now.

This is going to a major performance improvement.

Great :) Thanks for your own analysis and description in dotnet/SqlClient#19, it helped push the realization that a true batching API really is necessary.

Maybe the provider should be able to tell if it does support batching and other parameters such as what the maximum efficient number of commands and parameters is (before splitting needs to happen). Callers such as EF could interrogate the provider and adapt their behavior.

This is similar to the point @FransBouma made above. The way I'm looking at it, the API consumer (e.g. EF Core) should not be concerned with such implementation details - they should simply construct a single big batch and execute it. If a specific provider's batching implementation considers that the batch should be split - either because it must be (e.g. too many parameters) or because it would run more efficiently that way - then it would do so internally in a transparent way.

Unless I'm mistaken, for SQL Server there's even a threshold below which batching isn't efficient at all (see https://github.com/aspnet/EntityFrameworkCore/issues/9270). The SqlClient implementation would simply not batch at all for these cases.

There may be some details which make it impossible or undesirable to totally abstract this away from the user - possibly around exception handling - but at the moment I can't see a reason not to.

FransBouma commented 5 years ago

@roji

Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported.

The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support.

Ok, though if the shim fails, it's of no use to the user, so the user should be aware of the limits up front. E.g. if DB2 doesn't yet have their own implementation (and thus falls back on the shim, like any other provider) it will fail. This might be OK, but the user might be confused: "I use the default one and it breaks". So documentation might be in order that using the shim likely doesn't work in some cases.

Be aware too that mixing select commands and update commands might not always work with all db's. So if you concatenate statements in the shim, the user might run into this limitation too. I think documentation is enough here, as it breaks at runtime and it's not determinable up front if it will.

This raises the question of how desirable it is for the shim to start implementing provider-specific logic, which really belongs in the provider implementations. Some basic functionality may make sense: knowing that some specific providers simply don't support batching, and executing the command sequentially could be useful for the transitional period. But ultimately the expectation is for providers to provide their own implementation, totally unrelated to the shim.

yeah this is a bit of a tough call: concatenating statements and then not uniquing the parameter names already makes it fail on most systems, so if you don't do that, the user has to do that, and they likely already take other precautions (like group inserts together in 1 shim instance, updates in another etc.).

Regarding the values returned by ExecuteNonQuery(Async) which uses the concatenated query command: There is a trick to use here which allows us to determine per update statement whether it affected 0 or more rows: append ;SELECT param=rowcountvar to the update query, where param is an output parameter and rowcountvar is the rowcount var of the db (e.g. @@ROWCOUNT) . If the parameter is 0, the update statement affected 0 rows and should be reported as such to the user.

So for each query concatenated you append a unique parameter and a unique snippet. A bit overkill (I decided not to go for this in my framework, as when 1 query fails, e.g. in the middle, the whole batch is rolled back but you likely have to rerun the whole batch anyway as that's way easier than to sift through which command is the wrong one and why, then schedule the batch again without the ones already succeeded).

roji commented 5 years ago

@bgribaudo

If I'm understanding this correctly, when a command set returns multiple result sets, the first result set is accessed by calling ExecuteReader() (or ExecuteReaderAsync()). Then, subsequent result sets are fetched by calling NextResult() on the last-returned DbDataReader instance (as returned by execute or the last call to NextResult()). All result sets returned by the set of commands are returned one after the other, with no distinction made as to which result set came from which command.

That's correct. Note that this is how the ADO.NET API already works when you use concatenation-based batching.

It's possible (at least in the SQL Server world) for a single command to return zero, one or multiple result sets--and the number returned can dynamically vary (e.g. a stored procedure might be programmed to return one result set under certain conditions and two under others). Making sense out of the returned result sets in a situation like this can get tricky/be impossible without knowing which set came from which command. [...]

That's a very valid point - the scenario of a single command returning a variable number of resultsets is indeed tricky, even if it seems like quite an edge case. Your proposal for a CommandSetResults (or some version thereof) could help, but IMHO this approach has the following drawbacks:

In addition, even without this something like this, there are some workarounds which can allow users to distinguish which resultset belongs to which command: one could look at the shape of the resultset (which columns it has) to determine which stored procedure generated it, or one could even insert an arbitrary SELECT 'delimiter' in the batch, which consuming code would look for as a marker that the previous command completed. This could be considered a bit hacky, but it works well and doesn't seem to have any downsides.

To summarize, I think the concern you raise is valid. However, in my opinion it's too much of a rare case to consider (significantly) complicating the API surface for the general, simple case - especiall when valid workarounds seem to exist.

roji commented 5 years ago

@FransBouma

The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support.

Ok, though if the shim fails, it's of no use to the user, so the user should be aware of the limits up front. E.g. if DB2 doesn't yet have their own implementation (and thus falls back on the shim, like any other provider) it will fail. This might be OK, but the user might be confused: "I use the default one and it breaks". So documentation might be in order that using the shim likely doesn't work in some cases.

Agreed. Another approach would be for the shim to contain a hard-coded list of the specific providers which support cancatenation-based batching, and only do it for them; for other providers it would fall back to non-batched execution. This would ensure that the shim always works, although it doesn't always actually batch.

A more extreme approach would be to give up entirely on actually batching in the shim, and always execute in non-batched fashion (this could make sense if we think that the duplicate parameter name problem is going to occur a lot - not sure about this).

Be aware too that mixing select commands and update commands might not always work with all db's. So if you concatenate statements in the shim, the user might run into this limitation too. I think documentation is enough here, as it breaks at runtime and it's not determinable up front if it will.

Absolutely - this is a database-specific detail that we shouldn't make any attempt to handle. Documentation will be important here.

yeah this is a bit of a tough call: concatenating statements and then not uniquing the parameter names already makes it fail on most systems, so if you don't do that, the user has to do that, and they likely already take other precautions (like group inserts together in 1 shim instance, updates in another etc.).

That's true - I think we should do our reasonable best, but rewriting user-provided SQL of unknown dialects seems like it's a non-starter. Here as well, documentation will hopefully help people out, and providers will hopefully implement their own DbCommandSet to obviate the shim entirely.

Regarding the values returned by ExecuteNonQuery(Async) which uses the concatenated query command: There is a trick to use here which allows us to determine per update statement whether it affected 0 or more rows: append ;SELECT param=rowcountvar to the update query, where param is an output parameter and rowcountvar is the rowcount var of the db (e.g. @@ROWCOUNT) . If the parameter is 0, the update statement affected 0 rows and should be reported as such to the user. [...]

Yes, this is similar to what EF Core does for SQL Server in order to implement optimistic concurrency, which requires knowing for each update if it worked or not. This is exactly the kind of details that I'd expect the SqlDbCommandBatch to take care of internally, abstracting away the task from O/RMs and users. By the way, PostgreSQL provides this information as part of the wire protocol, so no additional hacky queries are needed.

GSPP commented 5 years ago

Thanks @roji. It seems I overlooked a few statements in this long thread.

should not be concerned with such implementation details

Unless I'm mistaken, for SQL Server there's even a threshold below which batching isn't efficient at all (see aspnet/EntityFrameworkCore#9270). The SqlClient implementation would simply not batch at all for these cases.

These details do not matter for semantics but for performance. The caller can adjust its behavior in ways that the provider cannot. For example, the provider does not understand the nature of the queries and the network latency environment.

EF might have 1 billion inserts queued but it will likely not want to create 1 billion command objects. So some moderation/negotiation on the number of commands per set has to take place in any case. If EF could query the maximum command and parameter count it could directly materialize the perfect number of commands.

I believe quite strongly, that the provider should never try to optimize the batch size or suppress batching in some cases. It's the wrong place to make these decisions.

The EF issue does not state what mechanism of batching was used. Maybe just concatenated SQL which has different properties than command sets (most significantly, recompilation). If command sets ever are slower I'd like to know why. What specific property (of the protocol?) caused that? It would be surprising to me.


A concatenation based shim would not work for statements that must be the single statement in the batch e.g. CREATE VIEW. Also, renaming parameters is observable. T-SQL can query the catalog views to obtain the currently executing SQL text. It might have implications for parameterization, query store and other tooling. Of course, SQL Server will get a true implementation but these examples show how difficult it is to shim this.

Maybe the shim should be the safest option possible and all optimizations should be left to the provider.

bgrainger commented 5 years ago

OK, I understand, but I don't see why the shim should do anything with these properties - can you explain why you think that's important? The shim would obviously set those properties on the internally-used batch command it creates (since that's the one that actually executes), but why the others?

I think we're saying the exact same thing here; I may have just worded it confusingly.

I am not proposing that DbCommandSet modify properties on any DbCommand that is added to the set.

All I'm suggesting is adding to the spec a bullet point that explains how the shim will set the properties on the internally-created batch command. Your spec mentions how CommandText is used; I thought it would be good to be thorough and explicitly document how all properties on DbCommandSet get applied to the internally-created batch command. (The bullet point I suggested adding deliberately paralleled your existing second bullet point.)

roji commented 5 years ago

@bgrainger ahh, understood! I indeed thought you wanted the shim to modify the properties on the user-provided commands. Added a bullet point as suggested.

roji commented 5 years ago

@GSPP

These details do not matter for semantics but for performance. The caller can adjust its behavior in ways that the provider cannot. For example, the provider does not understand the nature of the queries and the network latency environment. [...] I believe quite strongly, that the provider should never try to optimize the batch size or suppress batching in some cases. It's the wrong place to make these decisions.

I'd like to understand this point a bit better - it also seems to contradict somewhat with your comment above that proposes the provider expose the maximum efficient number of commands/parameters. After all, if the provider can expose this knowledge, why shouldn't it just act on it and remove the burden from the user?

But more importantly, I'm interested in exactly what kind of user knowledge could be useful for determining batch sizes/thresholds, that the provider does not have. Shouldn't batching always be desirable regardless of the network latency environment? How would the nature of the query impact the decision-making here? Some real-world examples could be useful on this.

Note that the user would always have the option of creating multiple DbCommandSets, according to whatever criteria they see fit. The question here is whether the provider implementation should make its own decisions on how to execute the commands given by the user. Don't forget we have some very clear provider-specific cases, like the max parameter limit per batch in the SQL Server case - do you really think this kind of thing should be the user's responsibility rather than the provider's?

The EF issue does not state what mechanism of batching was used. Maybe just concatenated SQL which has different properties than command sets (most significantly, recompilation). If command sets ever are slower I'd like to know why. What specific property (of the protocol?) caused that? It would be surprising to me.

That's very possible - I was also very surprised that batching two statements could actually be slower than not batching them, and was explained that it's related to overhead involved in batching. @divega and/or @AndriySvyryd may be able to provide more info.

EF might have 1 billion inserts queued but it will likely not want to create 1 billion command objects. So some moderation/negotiation on the number of commands per set has to take place in any case. If EF could query the maximum command and parameter count it could directly materialize the perfect number of commands.

Regardless of the above, if there really are many commands to execute, EF may decide to execute them in several batches - regardless of anything provider- or environment-specific, i.e. as a means of avoiding instantiation of lots of command instances. I'm not sure this is actually necessary, but the point is that it's a different kind of batch-breaking-down as the one discussed above.

roji commented 5 years ago

@GSPP one more comment - note that whether a particular implementation of DbCommandBatch splits the batch or not is entirely internal and provider-specific. We can of course publish guidelines and recommendations, but at the end of the day each provider will do whatever it is they think is necessary - in other words, it's not part of the API itself.

roji commented 5 years ago

@GSPP missed this last comment of yours:

A concatenation based shim would not work for statements that must be the single statement in the batch e.g. CREATE VIEW

Would it work in a non-concatenation-based batch? If not, then that seems OK - it's simply a limitation of SQL Server for both forms of batching. If CREATE VIEW can be batched, then it's an odd limitation :) But even in this case, it seems OK for the shim not to always work - it's a transitional mechanism until SqlClient gets its own optimized implementation, and it's a new API so nothing is being broken.

Also, renaming parameters is observable. T-SQL can query the catalog views to obtain the currently executing SQL text. It might have implications for parameterization, query store and other tooling. Of course, SQL Server will get a true implementation but these examples show how difficult it is to shim this.

I assume that by renaming parameters you mean changing their names so that they can be unique across all commands in the batch. If so, then I agree that this isn't viable, if only for practical reasons - we're not going to be able build an SQL parser into the shim that can detect placeholders which aren't part of SQL literals etc.

Maybe the shim should be the safest option possible and all optimizations should be left to the provider.

That seems to point towards the shim executing the commands without any sort of batching, as proposed above in the discussion with @FransBouma. I agree there's some merit in this...

Antaris commented 5 years ago

Is there going to be an considerations for T-SQLs GO batch separator? One thing that has always been a pain when building DB migration bits is when I want to execute scripts generated by SSMS that typically generate a number of batch scripts separated by GO.

Also, how will the different providers hand scope? For instance if I do the following:

DECLARE @val INT;
SELECT @val = 1;

If this were two commands, what happens to the scope of @val or is this actually a non-issue here?

Could an API be provided to accept a SQL statement and return a command batch? Is there any merit in such an API?

bgribaudo commented 5 years ago

[@GSPP] Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well.

[@roji] This is of course very provider-specific; the general batching API doesn't define any mandatory behavior but definitely should not get in the way either (there's already a note in the proposal above). I personally think that it makes more sense to abort the batch once an error occurs, but if continuing is the desired behavior for a specific provider (e.g. SQL Server), then it could throw an aggregate exception containing all the individual exceptions at the end of batch execution. At the moment I don't see anything in the proposal that needs to be changed, let me know if you see a problem.

I concur with the idea of allowing execution to continue on error when the server supports this. However, I think it's also important to have the ability to know about errors as they are encountered vs. after the entire batch has been executed.

Currently, with ADO.Net (at least, the SQL Server implementation), error exceptions are raised when the error is read from the incoming data stream. The consumer can then respond/react to the exception, as appropriate.

With this API as it currently stands, if a consumer is reading a result set that was not fully provided by the server due to an error, it sounds like the result set will look like a valid, complete result set. The consumer won't be able rely on that actually being the case until the entire set of commands has finishes executing. If the consumer wants to perform an irrevocable action on a per-data set basis, the consumer will need to cache all data sets as they are received while they wait for execution to complete and the possible aggregate exception before taking action.

If, instead, errors are raised as they are encountered, the consumer knows that when they read a complete data set, it is actually complete, and so can be actioned on immediately. If a related error is encountered during reading, the consumer is informed and gets to decide whether or not to continue on (e.g. move to the next data reader).

Another benefit of immediate exception on error is that, if the consumer decides that an error warrants cancelling the command set's execution, they can close the connection (perhaps automatically as the exception exits a using statement). The server can then detect this and cancel execution instead of continue on only for the consumer to ignore/discard the remainder of what's returned.

FransBouma commented 5 years ago

Currently, with ADO.Net (at least, the SQL Server implementation), error exceptions are raised when the error is read from the incoming data stream. The consumer can then respond/react to the exception, as appropriate.

I don't know, if the error isn't a transient error, the current transaction is likely to be aborted on the server side, meaning whatever you decide to do, you have to run the previous commands again as they've been rolled back due to the aborted transaction.

roji commented 5 years ago

@bgribaudo I'm not too sure about the handling of multiple errors in general and especially as they come - but some more information on exactly how SQL Server behaves is probably needed here (@divega do you have this knowledge?).

First, I agree with @FransBouma that in general, an error is likely to trigger a batch failure and/or transaction rollback - although this kind of thing is definitely database-dependent.

With this API as it currently stands, if a consumer is reading a result set that was not fully provided by the server due to an error, it sounds like the result set will look like a valid, complete result set.

You're talking about a partial resultset that gets terminated by an error, without a way for the user to know this? If the error is transient (e.g. network partition, server out of memory) then it's very unlikely that the batch can continue in any case (since connectivity has been lost). If you're thinking about a non-transient error, then I have no idea how that could happen in the middle of a resultset - do you have any concrete, real-world example?

roji commented 5 years ago

@Antaris

Is there going to be an considerations for T-SQLs GO batch separator?

I'm not an expert on SQL Server, but unless I'm mistaken the GO bit is required in SQL scripts to signal the end of a batch, and isn't part of the protocol in any way (as per the docs). In other words, it operates at a higher level than what we're discussing here, which is the low-level interface with the ADO.NET provider. A higher-level tool reading an SQL script may look for GO in order to execute the statements in a batch using this API, but it doesn't seem like this API is itself supposed to be aware of GO.

Also, how will the different providers hand scope?

That is a very SQL Server-specific question, and doesn't seem like it should be the concern of this API abstraction. If a variable's scope is restricted to its batch (as seems to be indicated in the GO docs), then it's the user's responsibility to make sure all statements referring to that variable are in the same batch as its declaration.

Could an API be provided to accept a SQL statement and return a command batch? Is there any merit in such an API?

The existing proposal already allows you to construct a DbCommandSet contain only a single DbCommand. However, I'm not sure what would be the point of that.

bgribaudo commented 5 years ago

@roji:

I personally think that it makes more sense to abort the batch once an error occurs, but if continuing is the desired behavior for a specific provider (e.g. SQL Server), then it could throw an aggregate exception containing all the individual exceptions at the end of batch execution.

@roji:

You're talking about a partial resultset that gets terminated by an error, without a way for the user to know this? If the error is transient (e.g. network partition, server out of memory) then it's very unlikely that the batch can continue in any case (since connectivity has been lost). If you're thinking about a non-transient error, then I have no idea how that could happen in the middle of a resultset - do you have any concrete, real-world example?

Hope you're having a great Friday!

My comment has to do with the "but if continuing" idea from the first quote above where errors are saved until the end of batch execution then thrown in an aggregate exception.

Say a single batch is executed against a SQL Server which returns two result sets. An error is encountered with the first result set resulting in the server being unable to fully return it. The second result set is returned in its entirety without any errors. In order to hold throwing the exception for the first result set's error until after execution of the entire batch has been completed (e.g. after the second result set has been read in its entirety), wouldn't the first result set have to be readable by the consumer without throwing exceptions?

In this scenario, the consumer would read as many rows out of the first result set as were returned before the error then be told by the data reader that the result set is complete (e.g. reader.Read() would return false), then move to the next result set (reader.NextResult()), read it in its entirety, and then and only then receive the aggregate exception letting them know that the first result set really wasn't complete.

I think this would be dangerous; that it would be better to throw an exception for a fatal error when it is encountered. If the consumer wants to continue working with additional result sets from the batch, they can catch the exception and move to the next reader. This is how SqlCommand currently works.

Example of SqlCommand's behavior:

using (var command = connection.CreateCommand())
{
    command.CommandText = "SELECT 1/0; SELECT 2";

    using (var reader = command.ExecuteReader())
    {
        try
        {
            reader.Read(); // Throws an exception because of the error in the first result set.
        }
        catch (SqlException e) when (e.Message == "Divide by zero error encountered.") {
            // However, if we catch the exception, we can move to reading the next result set.
        }

        // The fact that the first result set died on an error didn't cause SQL Server to abort the entire batch.
        // The second result set is still returned.

        reader.NextResult();

        reader.Read();
        reader.GetInt32(0); // returns 2
    }
}
roji commented 5 years ago

@bgribaudo, if you're happy with the DbCommand API as in the example above, then that's exactly what DbCommandSet will have according to the current proposal: when you invoke DbCommandSet.ExecuteReader(), you get a reader that can behave in the exact same way as above - depending on how a particular provider implements it. The API doesn't force a particular error handling behavior on implementors. In fact, the idea is for the DbCommandSet API to be as close as possible to the DbCommand API.

Does that make sense?

YohDeadfall commented 5 years ago

Add an overload of ExecuteNonQuery[Async]() which returns an int[]. Calling it would incur a heap allocation (although pay-per-play), and the naming for this isn't going to be pretty. In addition, add a similar int[] property on DbDataReader;

I don't think that having this property is a good idea. You already can get affected rows by each statement by iterating over results.

Returning an int as a result of ExecuteNonQuery doesn't provide meaningful information. For example, you have two DML statements. The first one should modify 2 rows, and the second - 3 rows. But because of an error you could get 3 and 2. The sum still is 5.

roji commented 5 years ago

I don't think that having this property is a good idea. You already can get affected rows by each statement by iterating over results.

Affected rows is mainly for DELETE, UPDATE and INSERT, where there are no results to iterate over... Think of a batch containing several updates - we want to provide the user with a way of knowing which update affected how many rows.

Returning an int as a result of ExecuteNonQuery doesn't provide meaningful information. For example, you have two DML statements. The first one should modify 2 rows, and the second - 3 rows. But because of an error you could get 3 and 2. The sum still is 5.

That's precisely the problem we're trying to solve. There's an int-returning version of ExecuteNonQuery() because users rarely care about the command-by-command breakdown (and the array allocation is involved), and also because we want to keep the API similar to DbCommand. This is why we're discussing the possibility of having two ways to get the affected records count.

YohDeadfall commented 5 years ago

Affected rows is mainly for DELETE, UPDATE and INSERT, where there are no results to iterate over... Think of a batch containing several updates - we want to provide the user with a way of knowing which update affected how many rows.

var recordsAffected = new array[10];
set.ExecuteNonQuery(recordsAffected);

// or

var recordsAffected = new array[10];
var currentIndex = 0;
using (var reader = set.ExecuteReader())
{
    do
    {
        recordsAffected[currentIndex] = reader.RecordsAffected;
    }
    while (reader.NextResult());
}

The first way perfectly fits the requirement. You know how many rows affected by each command in a batch. The second approach shouldn't be used when you're doing some data manipulations, but it's the right way to get the count of affected records in cases when you mix data selection and manipulation.

roji commented 5 years ago

@YohDeadfall, your first proposal is essentially what @divega proposed (option 2 under "Affected Rows").

The 2nd option of repeatedly checking RecordsAffected doesn't work because UPDATE, DELETE and INSERT don't have resultsets. For example, if you currently send a batch containing "SELECT ...; UPDATE...; SELECT ..." you get only two resultsets back. Also, in the docs for DbDataReader.RecordsAffected it's specified that "The RecordsAffected property is not set until all rows are read and you close the DataReader".

YohDeadfall commented 5 years ago

Also, in the docs for DbDataReader.RecordsAffected it's specified that "The RecordsAffected property is not set until all rows are read and you close the DataReader".

Could this behavior be changed for command sets only at least? I mean the last part of the sentence.

roji commented 5 years ago

Could this behavior be changed for command sets only at least? I mean the last part of the sentence.

It could, but there's still the other more important issue of INSERT, UPDATE and DELETE not returning a resultset.

One more thing with regards to your 1st proposal (overload of ExecuteNonQuery() which accepts an array)... See the note in https://github.com/dotnet/corefx/issues/35135#issuecomment-461422211 about stored procedures returning an arbitrary number of resultsets - this makes it harder for the user to know in advance exactly how many resultsets there's going to be.

YohDeadfall commented 5 years ago

It could, but there's still the other more important issue of INSERT, UPDATE and DELETE not returning a resultset.

Don't know how other providers work, but Npgsql (source) and SqlClient (source) use data readers to get the number of modified rows.

See the note in dotnet/corefx#35135 (comment) about stored procedures returning an arbitrary number of resultsets

It would be strange to get the count of affected rows for each result set from the stored procedure if the user doesn't know what's it doing. Is it an real scenario?

roji commented 5 years ago

It could, but there's still the other more important issue of INSERT, UPDATE and DELETE not returning a resultset.

Don't know how other providers work, but Npgsql (source) and SqlClient (source) use data readers to get the number of modified rows.

Yes, readers aggregate the rows affected from all executed commands. But the point is that there is no way to make them "stop" on each delete/update/insert and get the individual rows affected by it, because they have no resultset. It's not that they have empty resultsets - they have no resultsets at all, so when would you access DbDataReader.RecordsAffected to get the value for a specific delete/update/insert statement?

See the note in dotnet/corefx#35135 (comment) about stored procedures returning an arbitrary number of resultsets

It would be strange to get the count of affected rows for each result set from the stored procedure if the user doesn't know what's it doing. Is it an real scenario?

@bgribaudo hasn't provided a real scenario, but it doesn't seem completely implausible - you provide some parameter to the function which determines in some way how many resultsets are returned. In any case, I don't think this is decisive.

YohDeadfall commented 5 years ago

But the point is that there is no way to make them "stop" on each delete/update/insert and get the individual rows affected by it, because they have no resultset.

The next statement must be executed when NextResult is called, and the rest of them on closing.

roji commented 5 years ago

But the point is that there is no way to make them "stop" on each delete/update/insert and get the individual rows affected by it, because they have no resultset.

The next statement must be executed when NextResult is called, and the rest of them on closing.

I'm sorry but I'm not following... Can you please explain your proposal for fetching the non-aggregated affected records for a batch of updates (UPDATE ...; UPDATE ...; UPDATE ...)? With the current reader API, this kind of batch produces no resultsets whatsoever, so there's no way to get the records affected by each individual commands.

bgribaudo commented 5 years ago

@roj:

@bgribaudo, if you're happy with the DbCommand API as in the example above, then that's exactly what DbCommandSet will have according to the current proposal: when you invoke DbCommandSet.ExecuteReader(), you get a reader that can behave in the exact same way as above - depending on how a particular provider implements it. The API doesn't force a particular error handling behavior on implementors. In fact, the idea is for the DbCommandSet API to be as close as possible to the DbCommand API. Does that make sense?

Yes! Thanks!

bgribaudo commented 5 years ago

See the note in dotnet/corefx#35135 (comment) about stored procedures returning an arbitrary number of resultsets

It would be strange to get the count of affected rows for each result set from the stored procedure if the user doesn't know what's it doing. Is it an real scenario?

@bgribaudo hasn't provided a real scenario, but it doesn't seem completely implausible - you provide some parameter to the function which determines in some way how many resultsets are returned. In any case, I don't think this is decisive.

For a real-world example, how about SQL Server's sp_help? "The result sets that are returned depend on whether name is specified, when it is specified, and which database object it is." For example, if that sp is passed a table name as its parameter, the number of result sets returned can vary--e.g. based on whether the table has indexes and/or constraints defined.

The number of result sets returned can vary even when each invoked procured/batch has been build to return a fixed number of result sets. Say, several commands are batched together with no transaction in use. The first is an invocation of a particular stored proc that was built to return exactly two results sets. However, while assembling the first result set, an error is encountered, causing the procedure's execution to terminate. Execution then moves on to the next batch in the command set, which successfully returns one result set (as expected). Execution of the entire command set returned two result sets (one dying with an error, one without errors). The assumption that the first two result sets returned correspond with the first stored procedure doesn't hold here, even though that procedure is expected to return a fixed number of result sets.

bgribaudo commented 5 years ago

@YohDeadfall

Also, in the docs for DbDataReader.RecordsAffected it's specified that "The RecordsAffected property is not set until all rows are read and you close the DataReader".

Could this behavior be changed for command sets only at least? I mean the last part of the sentence.

Are you looking for a way to, say, get the affected row count from the first command in a batch then use that value to decide whether to read the results from the second command in a batch vs. having to wait until you've read all result sets returned from all commands before being able to access the affected count (or the array of affected counts)?

divega commented 5 years ago

@roji, @GSPP & others, here is way back machine link to a short article that explained how batching worked in SqlClient with SqlDataAdapter in ADO.NET 2.0:

https://web.archive.org/web/20080808113743/https://blogs.msdn.com/dataaccess/archive/2005/05/19/420065.aspx

Copying the important bits here:

The other way of doing batching is “simpler”, but happens at a lower level. I’m going to describe the SQL Server implementation here, but other database servers might have a similar feature. At the client-server protocol level (TDS), each command is executed against the server using a “request”. For parametrized and batched statements, this request is an “RPC request” (nothing to do with the DCE RPC, just same name). It turns out that in SQL Server we can put one RPC request on the network, and right before putting the “we’re done” mark on the wire, we can actually put a “here comes another RPC” mark. This way, we can keep sending RPCs over the wire all together. Once we sent all of them, we switch to listen for responses, and we get the responses for each RPC in the order we sent them. The protocol has the appropriate infrastructure that allows us to distinguish when a response for a given RPC ends and the next starts.

This means that we get rid of the following problems:

  • We can send multiple request for statement execution in a single network round-trip, which was the original goal
  • No need to parse statements and rename parameters, because each statement goes in a different request
  • No procedure cache bloat, because regardless of the order of execution, each row has an RPC of its own
  • No limit to the number of rows that can be updated with a single batch, because we don’t depend on lots of parameters for a single RPC

There are some minor specific drawbacks to RPC-level batching, such as paying a little bit of extra cost in the server per-RPC execution, but the benefits above in general out-weigh those.

The key here is that if we can leverage the existing batch support (currently only accessible through DataTable or DataSet) there is no 2100 parameter limit for the whole batch.

roji commented 5 years ago

@divega that's some great info! Something tells me that we may also be surprised by the optimal batch size with this batching style (i.e. it may be 2 😄 ). And yeah, it's also great to know that the 2100 parameter limit goes away...

YohDeadfall commented 5 years ago

@bgribaudo, I'm looking for other ways to get the count of affected rows than adding a property or something like that for which I should pay even if it isn't used.