dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
18.7k stars 3.98k forks source link

IDE0305 "collection initialization can be simplified" spoils chained LINQ calls #70833

Open Timovzl opened 7 months ago

Timovzl commented 7 months ago

(I couldn't find a more appropriate type than "bug report", but feel free to adjust this is there is a mroe correct option.)

Version Used:

.NET 8, VS 17.8.0.

Steps to Reproduce:

private static class EnumValueCache<T>
    where T : struct, Enum
{
    public static readonly ImmutableArray<T> SortedValues = Enum.GetValues<T>()
        .Order()
        .ToImmutableArray();
}

The analyzer suggests the following:

private static class EnumValueCache<T>
   where T : struct, Enum
{
   public static readonly ImmutableArray<T> SortedValues =
   [
      .. Enum.GetValues<T>()
               .Order()
,
   ];
}

Diagnostic Id:

IDE0305.

Expected Behavior:

The result should be a readability improvement.

I'm not sure if this is the perfect analysis, but here's a quick attempt: A LINQ To...() call chained after other LINQ extension calls should not be flagged by the analyzer, since that would promote reduced readability.

Actual Behavior:

The analyzer suggests using a collection expression, thus significantly reducing the readability of the code.

Although I'm aware that the analyzer can be suppressed, its behavior is very desirable for non-chained callsL

new[] { 1, 2, 3, }.ToImmutableArray() => [ 1, 2, 3, ]

CyrusNajmabadi commented 7 months ago

Outside of hte bad formatting, i find the 'after' much more readable.

Timovzl commented 7 months ago

Outside of hte bad formatting, i find the 'after' much more readable.

I'm honestly surprised. Can you help me understand your perspective?

I'll share mine as well.

Without collection expression:

Ok, that's perfectly linear.

With collection expression:

The latter seems to shuffle concerns, reducing linearity of thought. For the entire chain of operations, I need to remember the context from which we are doing those operations: we're using the spread operator and then materializing into a target-typed collection. That may seem trivial, but such mental juggling increases cognitive load.

In fact, the distinction strikes me as a less severe equivalent to this one:

// Good example: unintended main path

if (bad)
   throw;

if (also bad)
   log;
   return;

// Easy to validate mentally:
// Any concerns that popped up in my head along the way
// got handled and caused short-circuiting,
// allowing me to forget them as we kept moving forward
do quite a few lines of work;
// Bad example: indented main path

if (good)
{
   if (also good)
   {
      do quite a few lines of work;
   }
   else
   {
      // Starting to become more difficult to remember the context
      log;
      return;
   }

   // In what scenarios do we get here, exactly?
   throw;
}
CyrusNajmabadi commented 7 months ago

I'm honestly surprised. Can you help me understand your perspective?

Sure!

ImmutableArray<T> SortedValues = [.. Enum.GetValues<T>().Order()];

This reads to me as "make me an immutable array by spreading the ordered values of some enum T".

How the immutable array is created isn't relevant to me. Indeed, for all i know ToImmutableArray is actually not-optimal. I'd prefer to defer to the compiler to figure out how to this. Furthermore, it's nice and terse (though i grant that terseness is not always a virtue), and avoids pointless redundancy (like saying ImmutableArray twice).

The 'meat' here, as it were, is simply "i want the ordered enum values". The conversion to an ImmutableArray (or any other collection for that matter) is a much less interesting detail that can be lowered at the distraction level by using [ .. ]. So the above majorly increases signal, while decreasing noise.

Thanks! :)

Timovzl commented 7 months ago

@CyrusNajmabadi Cheers, that makes sense. I'm afraid it doesn't scale to longer chains, unfortunately. I should have picked an example that made that more clear.

How about something like this:

var values = input
   .Select(item => item.NullableEnumValue)
   .Where(value => value is not null)
   .Distinct()
   .Order()
   .ToImmutableArray();

With such complexity and multiline-ness, it's distracting (at least to me) to have to juggle the ongoing spread operation and collection expression in my head until I see them completed only at the very end. I'd very opt for the To...() variant here.

Then, even if the chain were short, I'd want to be consistent, so I'd personally stick with this style. That also prevents the need to rewrite the entire thing if more LINQ calls are added later on to an initially short chain..

The analyzer currently disagrees with this, requiring either (A) following its suggestion, (B) introducing an ugly suppression, or (C) silencing the analyzer and losing its perks (i.e. when there is only a call to To...()).

I'm curious what you think considering these more complex scenarios.

As an alternative, if opinions turn out to vary too much, the analyzer could be split into one for solo To...() calls vs. one for chained LINQ calls. That way, the former can be kept enabled, with the latter configured based on preference.

CyrusNajmabadi commented 7 months ago

I would just not apply the analyzer there. The analyzer is not saying you should do this. It's saying you can do this. Of you prefer the existing form in some locations, then stick with it (same with all other analyzers:-))

Timovzl commented 7 months ago

In practice, though, as soon as a solution starts accumulating messages, people stop actively considering messages. The relevant ones simply drown between the deliberately ignored ones.

Diligently suppressing the undesired messages prevents this effect.

CyrusNajmabadi commented 7 months ago

@Timovzl I don't see how that differs from the hundreds of other messages/infos we've shipped in the last years. Infos are not warnings (unless you make them as such). They're just little nuggets of infos about things you can do. In roslyn, for example, we use them in many places, but there are still 10s of thousands of them we do not. Trying to give a distinct number or knob so each invidivdual customer can try to cut each one off for the exact cases they care about just isn't viable.

jgilbert2017 commented 7 months ago

i believe the argument is that it's a signal to noise issue. i agree that the current incarnation of this message produces too much noise which in turns leads me to disable it entirely and lose any potential benefits.

Timovzl commented 7 months ago

@CyrusNajmabadi Alright, I understand that there are limits to degree of tuning that can be supported.

I think we drifted somewhat from the core issue at an interesting point. The following discovery was emerging from our discussion:

// No chained LINQ calls: collection expression is arguably better

ImmutableArray<T> values = [ 1, 2, 3, ];

var values = new[] { 1, 2, 3, }.ToImmutableArray();

// Many chained LINQ calls: collection expression is arguably worse

ImmutableArray<T> sortedValues = [.. input
   .Select(item => item.NullableEnumValue)
   .Where(value => value is not null)
   .Distinct()
   .Order()];

var values = input
   .Select(item => item.NullableEnumValue)
   .Where(value => value is not null)
   .Distinct()
   .Order()
   .ToImmutableArray();

// One or two additional LINQ calls: hard to pick a clear winner

ImmutableArray<T> sortedValues = [.. input.Select(item => item.NullableEnumValue).OrderBy(value => value)];

var values = input
   .Select(item => item.NullableEnumValue)
   .OrderBy(value => value)
   .ToImmutableArray();

If such a preference can be agreed upon roughly, then we can get the best of both worlds: the analyzer could pick up lonely To...() calls while leaving alone any chains of LINQ calls.

jgilbert2017 commented 7 months ago

@CyrusNajmabadi

How the immutable array is created isn't relevant to me. Indeed, for all i know ToImmutableArray is actually not-optimal. I'd prefer to defer to the compiler to figure out how to this.

Not really understanding what is going on under the hood with the spread operator, would it make any sense to only produce a message when there is a performance impact rather than strictly as a style preference?

(Is there any write up on the performance impact?)

CyrusNajmabadi commented 7 months ago

Not really understanding what is going on under the hood with the spread operator, would it make any sense to only produce a message when there is a performance impact rather than strictly as a style preference?

The idea behind collection expressions is that the compielr can take any route it wants to produce the best result. Which means it can do things like avoid slow methods/extensions/etc. and just use its knowledge of the types to do smart things like copy values directly into destination locations/etc. :)

CyrusNajmabadi commented 7 months ago

// Many chained LINQ calls: collection expression is arguably worse

Personally, i disagree. However, i'll see if there's anything we can do here. I far prefer the collection-expr form when looking at that. It helps me spearate out "here's the stuff that's actually doing transformational/computation/querying/linq work, versus "here's something tacked on just so i realize the list". The latter seems like noise to me, and is better expressed by just telling the compiler "i want you to do it".

But that's me :) As i said, i'll be looking into this to see if there are options here. Thanks!

jgilbert2017 commented 7 months ago

Not really understanding what is going on under the hood with the spread operator, would it make any sense to only produce a message when there is a performance impact rather than strictly as a style preference?

The idea behind collection expressions is that the compielr can take any route it wants to produce the best result. Which means it can do things like avoid slow methods/extensions/etc. and just use its knowledge of the types to do smart things like copy values directly into destination locations/etc. :)

Yes, I understand what you are saying but my question is can the code analysis message only trigger in the case where the compiler would generate something presumably superior to a simple .ToArray() style call.

CyrusNajmabadi commented 7 months ago

There is no way to no currently, and the idea is that this allows you to ahve one form, that you never need to update, that is always the most optimal.

jgilbert2017 commented 7 months ago

ok, that makes sense i guess. are there any benchmarks that highlight the performance benefit (i couldn't find any via google).

KUTlime commented 5 months ago

Some more code for a debate about styling:

Before

accumulatedRideRequestsResult = _mapper.Map<List<RideRequestModel>>(
    await _rideOfferAllocatorService
        .ExtractRideRequestsFromRideOffers(
            _mapper.Map<List<RideOffer>>(accumulatedRideOffersResult), cancellationToken)).ToList();

After analyzer suggested change

accumulatedRideRequestsResult =
[
    .. _mapper.Map<List<RideRequestModel>>(
                await _rideOfferAllocatorService
                    .ExtractRideRequestsFromRideOffers(_mapper.Map<List<RideOffer>>(accumulatedRideOffersResult), cancellationToken)),
];
CyrusNajmabadi commented 5 months ago

The indentation looks bad. But otherwise the latter form is what I would expect. It could be written in that form, or like:

accumulatedRideRequestsResult = [.. _mapper.Map<List<RideRequestModel>>(
    await _rideOfferAllocatorService
        .ExtractRideRequestsFromRideOffers(_mapper.Map<List<RideOffer>>(accumulatedRideOffersResult), cancellationToken))];
Brunni commented 3 months ago

I have some additional question here. When I look at the generated IL code for the syntax List a = [... enumerable] instead of List a = enumerable.ToList() I see quite some difference here:

The IDE suggested ide0305 to use the brackets. But this is the IL code afterwards:

List<MyClass> myClassList = new List<MyClass>();
      foreach (MyClass myClass in (IEnumerable<MyClass>) this._orderEditContext.Order.MyClass.OrderBy<MyClass, int>((Func<MyClass, int>) (instance => newSequence.IndexOf(instance.CustomerOrderId))))
        myClassList.Add(MyClass);
      order.MyClasss = myClassList;

For me this does not look efficient. The ToList implementation uses the size of the enumerable to detect the required size for the list. Am I wrong here?

Edit: Minimal Example to reproduce:

List<Order> orderList = [];

orderList = orderList.OrderBy(o => o.OrderType).ToList();
CyrusNajmabadi commented 3 months ago

When I look at the generated IL code for the syntax List a = [... enumerable] instead of List a = enumerable.ToList() I see quite some difference here:

Optimizations are under control of the compiler. @RikkiGibson Taking an IEnumerable<T> and converting to a List<T> should go through Enumerable.ToList(...) as that's the optimized runtime path (that can do things like runtime introspection of the type to do things as efficiently as possible).

RikkiGibson commented 3 months ago

Indicated codegen looks out of date to me. Current codegen looks like SharpLab

    public List<T> M<T>(IEnumerable<T> en)
    {
        List<T> list = new List<T>();
        list.AddRange(en);
        return list;
    }

which seems good enough to me. (or, let's say, not bad enough that we need to make sure to fix the codegen here ASAP)

CyrusNajmabadi commented 3 months ago

@RikkiGibson AddRange does some of hte optimizations of .ToList: https://github.com/dotnet/runtime/blob/395db7bc740906a744b1ed9531a281d86a810c01/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/List.cs#L250

Though it does miss: https://github.com/dotnet/runtime/blob/395db7bc740906a744b1ed9531a281d86a810c01/src/libraries/System.Linq/src/System/Linq/ToCollection.cs#L63

We could consider just bypassing this all and calling right into new List<T>(enumerable);

But i agree. We're likely at a good place now absent some really strong data indicating a problem.

Brunni commented 3 months ago

Indicated codegen looks out of date to me. Current codegen looks like SharpLab

So you are saying, that the codegen for collection expressions was changed after dotnet version 8.0.102? It will be part of a next release?

RikkiGibson commented 3 months ago

Yeah it should be included in 8.0.200 and up.

MisinformedDNA commented 1 month ago

The indentation looks bad. But otherwise the latter form is what I would expect. It could be written in that form, or like:

accumulatedRideRequestsResult = [.. _mapper.Map<List<RideRequestModel>>(
    await _rideOfferAllocatorService
        .ExtractRideRequestsFromRideOffers(_mapper.Map<List<RideOffer>>(accumulatedRideOffersResult), cancellationToken))];

The readability on this is awful. To make the analyzer happy and get more readability, I sometimes split these expressions into two operations:

var accumulatedRideRequests = _mapper.Map<List<RideRequestModel>>(
    await _rideOfferAllocatorService
        .ExtractRideRequestsFromRideOffers(_mapper.Map<List<RideOffer>>(accumulatedRideOffersResult), cancellationToken));
accumulatedRideRequestsResult = [.. accumulatedRideRequests];

NOTE: I haven't benchmarked this.