dotnet / reactive

The Reactive Extensions for .NET
http://reactivex.io
MIT License
6.7k stars 751 forks source link

IAsyncGrouping is inconvinient #1211

Open neuecc opened 4 years ago

neuecc commented 4 years ago

In System.Linq.Async, GroupBy returns IAsyncGrouping<TKey, TSource> but internally it is IGrouping<TKey, TSource> and cast it.

(IAsyncGrouping<TKey, TSource>)_enumerator.Current

IAsyncGrouping is very inconvenient. For example, all those who follow can't use the normal methods and have to use ***Await.

AsyncEnumerable.Range(1, 10)
    .GroupBy(x => x)
    .SelectAwait(async x => await x.CountAsync()) // can not use standard .Select operator.

This is both an inconvenience and a poor performer.

As a suggestion, why don't you get rid of IAsyncGrouping and return IGrouping? Since ToLookup implements the standard ILookup,IGrouping, I don't think it's strange if the paired GroupBy is IGrouping.

quinmars commented 4 years ago

That's right. GroupBy is eager by design. There is no point in converting the group collections (IGrouping) to something that pretends to be async.

bartdesmet commented 4 years ago

While I can see reduced ergonomics due to this decision, a GroupBy implementation doesn't have to be eager, and the interface allows for that. For example, enumeration to the next group or enumeration within an existing group can advance the underlying iteration to the point where it's able to provide the next group or value (while in the process materializing any other groups that may pop up, and bucketizing elements that belong to other groups). E.g.

Enumerable.Range(0, int.MaxValue).GroupBy(i => i % 2).Take(1).SelectMany(g => g.Take(1))

has no reason (other than the current implementation) to result in an OutOfMemoryException.

We did actually have a GroupByLazy in Ix.NET at some point, but it seems that never made it into the public Ix release, in part due to a legal concern around a patent application that was part of an undisclosed distributed database product design. The implementation of the async variant of Ix has therefore stuck to eager grouping as well, while keeping the interface such that laziness can always be added. (I can check in with legal folks to see if there are any objections against adding this functionality to Ix without users being impacted by that patent.)

To elaborate a bit more, retrieving a group from a remote queryable source and subsequently enumerating over it is a scenario where non-blocking asynchronous enumeration is warranted. The server materializes a partial, lazy result set (snapping to a version in the database, keeping a session or maintaining continuation tokens, and buffering an initial set of results to stream back to the client) for a given group, and the client enumerates over the groups (which can be canceled at any point in time or request a continuation of result yielding from the server). This has been a common design pattern (up to and including laziness and asynchrony) for a distributed database we've worked on over here, where groups are served by shards and the client (or an intermediate front-end service) merges results served up by different shards (that align with the groups requested by the user). Having an IAsyncGrouping interface allows for that. (Not to mention the ability to pass down cancellation all the way to the group level, even if fetching the group is a long-running eager process - that can periodically check for cancellation - and not really a fine-grained lazy-all-the-way-down process.)

As a side-note, it is unfortunate that IEnumerable<T> is overloaded with regards to laziness versus eagerness. (It is one of these things where a functional call-by-need lazy concept in an otherwise call-by-value eager language can lead to surprises, often "cured" carelessly with ToArray and whatnot.) In Rx, we have a history of encoding eagerness through IList<T> (e.g. Buffer versus Window). Alas, the interface for grouping has always been derived from IEnumerable<T> with no type-system level indication of eagerness, which is really an implementation detail of LINQ to Objects (only To* methods as well as OrderBy need eagerness) but IQueryable<T> query providers do not have to be eager at all (and use the same interface). Obviously an existing implementation can't change from eager to lazy (hence our GroupByLazy variant, as discussed above), because it takes away side-effects (no functional purity can be assumed when user code is in the mix).

Now, one could argue that given the eager nature of LINQ to Object's GroupBy, for better or for worse, the one in the async variant should be eager as well (and it is, and will have to remain like this, as discussed above), and hence IGrouping makes sense. (And a lazy variant could use IAsyncGrouping instead, though the synchronous Ix version would have no interface distinction, again leading to an asymmetry.) At this point, that's a breaking change though. It also poses a conundrum with regards to the queryable variant of the operator, which would have to be symmetric (otherwise adding a simple AsAsyncQueryable() step in a source would result in compile-time errors due to the change of signature of GroupBy when going from enumerable to queryable), thus preventing a query provider implementation of such an operator to support asynchronous fetching of a group. (Yes, many query providers are tied to the hip to some query language like SQL that forces "rectangularization" of the result, i.e. flattening of groups, and hence doesn't support higher-order results. But databases without that legacy - as we have built quite a few in the context of cloud - lift that restriction, as discussed above with the example of sharding.)

One more thing that's missing in all of this is support for language bindings with async query clauses. Having to use SelectAwait etc. to disambiguate overload sets clearly indicates there'd be some need for language design work to support asynchrony in query expressions and decide on binding rules to query operator methods. If such support were there (and there are no current plans to look into it as far as I can tell, but one should feel free to nag the folks on the C# LDM about it), a query expression for the sample you wrote would inevitably indicate the asynchrony of a projection following a group. And that's not unexpected given the contagious nature of async/await when converting from sync to async. It's just that a language binding for query expressions (for those who wish to use it) could ease the pain. In particular, the lightest support for async in query clauses could simply be the presence of an await expression, thus making the async modifier implicit, and the binding to an XYZAwait overload transparent. (There are some snags with this for more complex operators like GroupJoin where many lambdas are synthesized and the full matrix of 2^N overloads is not provided in the library.) In the fluent form, as you've shown above, I do fully agree that the verbosity of SelectAwait with and async lambda and an await expression inside is rather baroque.