Open HavenDV opened 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @eiriktsarpalis See info in area-owners.md if you want to be subscribed.
Author: | HavenDV |
---|---|
Assignees: | - |
Labels: | `api-suggestion`, `area-System.Linq`, `untriaged` |
Milestone: | - |
FWIW, I believe it would be extremely difficult to provide this information reliably. LINQ is an abstraction around arbitrary code, and the algorithmic complexity of any LINQ method is going to depend on the algorithmic complexity of the underlying enumeration.
Take Enumerable.Any<T>(IEnumerable<T>)
as an example. The complexity is O(1) if it wraps a known collection type. But it's O(?) if it wraps a generic IEnumerable<T>
, as it depends on the implementation of IEnumerator<T>.MoveNext()
. For most collection types this should be fairly fast, but what if you have the following?
IEnumerable<T> theEnumerable = GetEnumerable();
IEnumerable<T> reversed = theEnumerable.Reverse();
bool isNonEmpty = reversed.Any(); // what's the runtime of this method?
private IEnumerable<T> GetEnumerable()
{
while (condition)
{
yield /* ... */;
}
}
This could be O(n) (because the Reverse operation must be eagerly enumerated in its entirety). It could be O(n ln n) (if the call to GetEnumerable
returned an enumerable wrapped within an OrderBy clause). It could be worse. And so on.
I think there's definite value to documenting the algorithmic complexity of operations on other collection operations: List<T>
, HashSet<T>
, Dictionary<TKey, TValue>
, string
, T[]
, MemoryExtensions
, etc. But those are known quantities because we fully control the code being executed; or at least we control enough of the code being executed that we can define the complexity in terms of the parameters passed in. But when you have something like LINQ that's fundamentally an abstraction, I hesitate to make anything that could be construed as a guarantee.
This might make more sense as a tooling feature. Language services could know if the IEnumerable being called is implemented in the framework or not
For LINQ, I suggest adding a list of classes for which optimization is supported. An example for the Contains
method:
For ICollection<T>
, the complexity is the complexity of the original collection.
For other cases, the complexity is O(n).
Alternatively, add a separate extension method for the ICollection<T>
type with a separate description, and specify the complexity for the original method.
It's much more complex than just the original data type. LINQ operators flow information from one to the other, and the sequence of operators can impact the algorithmic complexity of the operation. For example, OrderBy might generally be O(n log n), but it can end up being O(n) for example if you have OrderBy(...).First(...). Cataloguing all possible combinations, in concert with various inputs and their characteristics, in concert with all the various modes of consumption, would make trying to document all possible algorithmic complexities an uphill battle, as Levi says, and also something difficult to reason about. I do not believe it's worth while.
I believe a rough outline, similar to how Python does it in their wiki would be very useful. It doesn't need to attempt to be complete, and still could list some good-to-know corner cases (e.g. Contains on sets, or the OrserBy.First optimization).
I don't think this should be on the LINQ method's own doc page, simply because then there would be an expection that it's both complete and accurate, but maybe we can collect this and place it as some aside article into the docs.
I think it would be good to provide a best effort doc for the operations. I don't know if it needs to be 100% precise, maybe we can leave room for some nuance by saying there are optimizations depending on the operations chained or the collection type. I don't know if not documenting it because its not perfect is useful...
Does it really help? For example, Enumerable.Contain's documentation already states that it will delegate to ICollection.Contains if the source implements that, otherwise it'll enumerate. OrderBy performs a sort, and any reasonable implementation is going to have average O(n log n) complexity in the common case (if we're not trying to document the complete matrix). I'm just not clear what the effort yields.
Side note - my proposal is rather aimed at giving the "average" developer a better understanding of what they are using. Because I'm not sure if the average developer would go through the documentation for the sake of optimization. Therefore, I mean exactly the built-in xml documentation. It is also sometimes useful to quickly brush up on knowledge.
I would read this thread https://www.reddit.com/r/dotnet/comments/npno3k/new_linq_extensions_in_net_6_and_benchmarks/. It has some discussion around the Big O of some of these methods. FWIW this doesn't belong in the doc comments IMO just a table would be enough of the various operations with various optimizations made to some specific collection types.
Without studying the LINQ source code and documentation for specific methods, it is sometimes difficult to quickly grasp the algorithmic complexity of a specific method. I think it would add more clarity to the development. I also admit that this can be a very small problem and you can close this issue.
Example: I had to dig into the source code to find out the LINQ complexity of the
Contains
method for theHashSet
class. It could be either O(n) or O(1) if LINQ contains a check for a specific class.For LINQ, I suggest adding a list of classes for which optimization is supported. An example for the
Contains
method: ForICollection<T>
, the complexity is the complexity of the original collection. For other cases, the complexity is O(n).Alternatively, add a separate extension method for the
ICollection<T>
type with a separate description, and specify the complexity for the original method.