dotnet / efcore

EF Core is a modern object-database mapper for .NET. It supports LINQ queries, change tracking, updates, and schema migrations.
https://docs.microsoft.com/ef/
MIT License
13.69k stars 3.17k forks source link

IN() list queries are not parameterized, causing increased SQL Server CPU usage #13617

Closed NickCraver closed 1 year ago

NickCraver commented 5 years ago

Currently, EF Core is not parameterizing IN(...) queries created from .Contains() (and maybe other cases). This has a very detrimental impact on SQL Server itself because:

  1. The query hash is different for each distinct set of IDs passed in.
  2. SQL Server needs to calculate a plan for each of these distinct combinations (CPU hit).
  3. The query plans created occupy an entry in SQL Server's plan cache (unless ad-hoc behavior is on), fighting for resources on other plans and cascading CPU issues by re-calculations due to evictions.

Note: when SQL Server has memory pressure, plan cache is the first thing to empty. So this has a profound impact at scale, doubly so when things have gone sideways.

Steps to reproduce

Here's a reduced down version of the problem:

var ids = Enumerable.Range(1, 1000).ToList();
var throwAwary = context.Users.Where(u => ids.Contains(u.Id)).ToList();

This results in the following:

SELECT [u].[Id], [u].[AcceptRateAccepted], [u].[AcceptRateAsked], [u].[AccountId], [u].[AnswerCount], [u].[BronzeBadges], [u].[CreationDate], [u].[DaysVisitedConsecutive], [u].[DaysVisitedTotal], [u].[DisplayName], [u].[Email], [u].[Flags], [u].[GoldBadges], [u].[HasAboutMeExcerpt], [u].[HasReplies], [u].[IsVeteran], [u].[JobSearchStatus], [u].[LastAccessDate], [u].[LastDailySiteAccessDate], [u].[LastEmailDate], [u].[LastLoginDate], [u].[LastLoginIP], [u].[LastModifiedDate], [u].[Location], [u].[OptInEmail], [u].[PreferencesRaw], [u].[ProfileImageUrl], [u].[QuestionCount], [u].[RealName], [u].[Reputation], [u].[ReputationMonth], [u].[ReputationQuarter], [u].[ReputationSinceLastCheck], [u].[ReputationToday], [u].[ReputationWeek], [u].[ReputationYear], [u].[SignupStarted], [u].[SilverBadges], [u].[TeamId], [u].[TeamName], [u].[TimedPenaltyDate], [u].[Title], [u].[UserTypeId], [u].[WebsiteUrl]
FROM [Users] AS [u]
WHERE [u].[Id] IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000);

Further technical details

EF Core version: 2.1.4 Database Provider: Microsoft.EntityFrameworkCore.SqlServer Operating system: Windows 2016/Windows 10 IDE: Visual Studio 2017 15.9

Proposal

EF Core should parameterize here. Instead of IN (1, 2, 3, ...) we should see IN (@__ids1, @__ids2, @__ids3, ...) or similar. This would allow query plan cache to be shared. For example if we ran this 1,000 times to fetch 1,000,000 users in batches, we'd have 1 plan in cache, whereas today we have 1,000 plans. Let's say a user gets removed (or added!) on page 2 of 1,000...today we'd calculate and cache another 999 plans next run.

To further address the cardinality problem, an approach similar to what Dapper does would be at least a starting point. We generate only lists of certain sizes, let's just say 5, 10, 50, 100, 500, 1000. Here's an example with 3 parameters:

SELECT [u].[Id], [u].[AcceptRateAccepted], [u].[AcceptRateAsked], [u].[AccountId], [u].[AnswerCount], [u].[BronzeBadges], [u].[CreationDate], [u].[DaysVisitedConsecutive], [u].[DaysVisitedTotal], [u].[DisplayName], [u].[Email], [u].[Flags], [u].[GoldBadges], [u].[HasAboutMeExcerpt], [u].[HasReplies], [u].[IsVeteran], [u].[JobSearchStatus], [u].[LastAccessDate], [u].[LastDailySiteAccessDate], [u].[LastEmailDate], [u].[LastLoginDate], [u].[LastLoginIP], [u].[LastModifiedDate], [u].[Location], [u].[OptInEmail], [u].[PreferencesRaw], [u].[ProfileImageUrl], [u].[QuestionCount], [u].[RealName], [u].[Reputation], [u].[ReputationMonth], [u].[ReputationQuarter], [u].[ReputationSinceLastCheck], [u].[ReputationToday], [u].[ReputationWeek], [u].[ReputationYear], [u].[SignupStarted], [u].[SilverBadges], [u].[TeamId], [u].[TeamName], [u].[TimedPenaltyDate], [u].[Title], [u].[UserTypeId], [u].[WebsiteUrl]
FROM [Users] AS [u]
WHERE [u].[Id] IN (@ids1, @ids2, @ids3, @ids4, @ids5);

The 5 Ids are so that 1-5 values all use the same plan. Anything < n in length repeats the last value (don't use null!). In this case let's say our values are 1, 2, and 3. Our parameterization would be:

This fetches the users we want, is friendly to the cache, and lessens the amount of generated permutations for all layers.

To put this in perspective, at Stack Overflow scale we're generating millions of one-time-use plans needlessly in EF Core (the parameterization in Linq2Sql lowered this to only cardinality permutations). We alleviate the cache issue by enabling ad-hoc query mode on our SQL Servers, but that doesn't lessen the CPU use from all the query hashes involved except in the (very rare) reuse case of the exact same list.

This problem is dangerous because it's also hard to see. If you're looking at the plan cache, you're not going to see it by any sort of impact query analysis. Each query, hash, and plan is different. There's no sane way to group them. It's death by a thousand cuts you can't see. I'm currently finding and killing as many of these in our app as I can, but we should fix this at the source for everyone.

ErikEJ commented 5 years ago

Related issue: https://github.com/aspnet/EntityFrameworkCore/issues/12777

kroymann commented 5 years ago

I wrote an extension to EF6 to handle this exact problem for our product and apparently it is extremely similar to what Nick is proposing above. By replacing this where clause: .Where(entity => myEnumerable.Contains(entity.prop)) with this extension method: .BatchedWhereKeyInValues(entity => entity.prop, myEnumerables, batchSize: 100), my solution will break the list of values into batches, and then produce a properly parameterized query that generates the desired SQL.

For example, this expression: query.BatchedWhereKeyInValues(q => q.Id, values: {1,2,5,7,8,9}, batchSize: 3)

would effectively become the following two queryables, where the numbers are supplied as variables that allow EF to parameterize the query rather than embed them as constants: query.Where(q => q.Id == 1 || q.Id == 2 || q.Id == 5) query.Where(q => q.Id == 7 || q.Id == 8 || q.Id == 9)

EF then converts those LINQ expressions into this SQL expression: WHERE Id IN (@p__linq__0, @p__linq__1, @p__linq__2)

Check out the source here: https://gist.github.com/kroymann/e57b3b4f30e6056a3465dbf118e5f13d

StasPerekrestov commented 5 years ago

@kroymann Works like a charm! Big Thanks!

public static class QueryableExtension<TQuery>
    {
        internal static IEnumerable<IQueryable<TQuery>> WhereIn<TKey>(IQueryable<TQuery> queryable,
            Expression<Func<TQuery, TKey>> keySelector, IEnumerable<TKey> values, int batchSize)
        {
            List<TKey> distinctValues = values.Distinct().ToList();
            int lastBatchSize = distinctValues.Count % batchSize;
            if (lastBatchSize != 0)
            {
                distinctValues.AddRange(Enumerable.Repeat(distinctValues.Last(), batchSize - lastBatchSize));
            }

            int count = distinctValues.Count;
            for (int i = 0; i < count; i += batchSize)
            {
                var body = distinctValues
                    .Skip(i)
                    .Take(batchSize)
                    .Select(v =>
                    {
                        // Create an expression that captures the variable so EF can turn this into a parameterized SQL query
                        Expression<Func<TKey>> valueAsExpression = () => v;
                        return Expression.Equal(keySelector.Body, valueAsExpression.Body);
                    })
                    .Aggregate((a, b) => Expression.OrElse(a, b));
                if (body == null)
                {
                    yield break;
                }

                var whereClause = Expression.Lambda<Func<TQuery, bool>>(body, keySelector.Parameters);
                yield return queryable.Where(whereClause);
            }
        }

// doesn't use batching
        internal static IQueryable<TQuery> WhereIn<TKey>(IQueryable<TQuery> queryable,
            Expression<Func<TQuery, TKey>> keySelector, IEnumerable<TKey> values)
        {
            TKey[] distinctValues = values.Distinct().ToArray();

            int count = distinctValues.Length;
            for (int i = 0; i < count; ++i)
            {
                var body = distinctValues
                    .Select(v =>
                    {
                        // Create an expression that captures the variable so EF can turn this into a parameterized SQL query
                        Expression<Func<TKey>> valueAsExpression = () => v;
                        return Expression.Equal(keySelector.Body, valueAsExpression.Body);
                    })
                    .Aggregate((a, b) => Expression.OrElse(a, b));

                var whereClause = Expression.Lambda<Func<TQuery, bool>>(body, keySelector.Parameters);
                return queryable.Where(whereClause);
            }

            return Enumerable.Empty<TQuery>().AsQueryable();
        }
    }

Usage:

int[] a = Enumerable.Range(1, 10).ToArray();
var queries = QueryableExtension<User>.WhereIn(dbContext.Users, t => t.Id, a, 5);
foreach (var queryable in queries)
{
    _ = queryable.ToArray();
}

var everything = QueryableExtension<User>.WhereIn(dbContext.Users, t => t.Id, a);
_ = everything.ToArray(); 
SonicGD commented 5 years ago

Also such queries don't aggregate well in logging systems like Application Insights, cause every query has unique sql.

divega commented 5 years ago

I just wanted to share here some lexperiments I did some time ago with different alternative approaches to Enumerable.Contains using existing EF Core APIs:

https://github.com/divega/ContainsOptimization/

The goal was both to find possible workarounds, and to explore how we could deal with Contains in the future.

I timed the initial construction and first execution of the query with different approaches. This isn't a representative benchmark because there is no data and it doesn't measure the impact of caching, but I think it is still interesting to see the perf sensitivity to the number of parameters.

The code tests 4 approaches:

Test("standard Contains", context.People.Where(p => (ids.Contains(p.Id))));

Test("Parameter rewrite", context.People.In(ids, p => p.Id));

Test("Split function", 
    context.People.FromSql(
        $"select * from People p where p.Id in(select value from string_split({string.Join(",", ids)}, ','))"));

Test("table-valued parameter", 
    context.People.FromSql(
        $"select * from People p where p.Id in(select * from {context.CreateTableValuedParameter(ids, "@p0")})"));

Standard Contains

This is included as a baseline and is what happens today when you call Contains with a list: the elements of the list get in-lined as constants in the SQL as literals and the SQL cannot be reused.

Parameter rewrite

This approaches tries to rewrite the expression to mimic what the compiler would produce for a query like this:

var p1 = 1;
var p2 = 2; 
var p3 = 3;
context.People.Where(p => (new List<int> {p1, p2, p3}.Contains(p.Id))).ToList();

It also "bucketizes" the list, meaning that it only produces lists of specific lengths (powers of 2), repeating the last element as necessary, with the goal of favoring caching.

This approach only works within the limits of 2100 parameters in SQL Server, and sp_executesql takes a couple of parameters, so the size of the last possible bucket is 2098.

Overall this seems to be the most expensive approach using EF Core, at least on initial creation and execution.

Split function

This approach was mentioned by @NickCraver and @mgravell as something that Dapper can leverage. I implemented it using FromSql. It is interesting because it leads to just one (potentially very long) string parameter and it seems to perform very well (at least in the first query), but the STRING_SPLIT function is only available in SQL Server since version 2016.

Table-valued parameter

For this I took advantage of the ability to create table-valued parameters that contain an IEnumerable<DbDataRecord> (I used a list rather than a streaming IEnumerable, but not sure that matters in this case) and pass them as a DbParameter in FromSql. I also added a hack to define the table type for the parameter if it isn't defined. Overall this seems to be the second most lightweight approach after the split function.

ErikEJ commented 5 years ago

@divega Just stumbled upon this: https://stackoverflow.com/questions/25228362/how-to-avoid-query-plan-re-compilation-when-using-ienumerable-contains-in-entity

ErikEJ commented 5 years ago

And this: https://vladmihalcea.com/improve-statement-caching-efficiency-in-clause-parameter-padding/

IGx89 commented 5 years ago

Here's an alternative solution I discovered thanks to a semi-related SO post by @davidbaxterbrowne:

var records = await _context.Items
    .AsNoTracking()
    .FromSql(
        "SELECT * FROM dbo.Items WHERE Id IN (SELECT value FROM OPENJSON({0}))",
        JsonConvert.SerializeObject(itemIds.ToList()))
    .ToListAsync();

It's specific to SQL Server 2017+ (or Azure SQL Server), and in my testing with a list of 50,000 IDs against a remote Azure SQL database it was 5x faster than a standard batching solution (dividing into lists of 1,000 IDs).

RyanThomas73 commented 4 years ago

I recently put together an implementation for my day job to address this issue (targeting EF Core v3.1.3)

I would be willing to put together a PR for this issue if the EF Core team would confirm how they want me to address a few of the limitations I had to work around.

  1. Adding additional parameters at the time the IRelationalCommand is generated:

The IN clause parameter expansion happens at: RelationalCommandCache.GetRelationalCommand(..) -> ParameterValueBasedSelectExpressionOptimizer.Optimize(..) -> InExpressionValuesExpandingExpressionVisitor.Visit(..)

https://github.com/dotnet/efcore/blob/6f4f078e8f92ee8e2ca497fa4b83d4ecebdd643b/src/EFCore.Relational/Query/Internal/ParameterValueBasedSelectExpressionOptimizer.cs#L222

One of the limitations I had to work around is the parameters dictionary being IReadOnlyDictionary<string, object>. In order to parameterize the IN clause I need the ability to add parameters into this dictionary. I could work around this in a solution PR by casting the dictionary back to IDictionary<> or by changing the call signatures to pass through the non read-only dictionary interface.

How would the EF Core team like this to be addressed?

  1. Controlling batch size, currently I handle this with two solutions: 2a. I use an IDbOptionsExtension to control enabling/disabling the IN clause parameterization and to supply the default batch size to pad the parameter count to.

    What dependency objects/names would the EF Core team would want me to use for these settings?

    2b. I provided my team a .WhereIn(predicate, batchSize) extension method to use. This extension method behaves identically to a normal .Where(predicate) and the query method translation process extracts the batch size and stores it. This batch size overrides the default value from the options extension when used.

    Would the EF Core team want me to include this extension method in my PR?

ErikEJ commented 4 years ago

I have updated one of @divega 's proposals above for 3.1: https://erikej.github.io/efcore/sqlserver/2020/03/30/ef-core-cache-pollution.html

GSPP commented 4 years ago

It is good to see that "advanced" solutions have been proposed here. Is it feasible to create a rather simple but quick fix first? Simply making the list a bunch of parameters (e.g. @p0, @p1, ...) would be much better than the literals. That way there's one plan for each list count in contrast to one plan for each set of constants.

thomasdc commented 4 years ago

Might be interesting for others, here's how we solved the problem by using LINQKit.Core:

var predicate = PredicateBuilder.New<User>();
foreach (var id in ids)
{
    predicate = predicate.Or(user => user.Id == id);
}

var throwAway = context.Users.Where(predicate).ToList();

This will generate a parameterized query that looks like this:

SELECT ...
FROM   [Users] AS [u]
WHERE  [u].[Id] = @id1 
    OR [u].[Id] = @id2
    OR [u].[Id] = @id3
    ...
cjstevenson commented 4 years ago

So do quoted queries and blackslashes lead to SQL injection attacks? E.g. a value of legit_value\\'); SELECT * FROM some_table limit 10; --.

I'm surprised that this is not treated as a security issue.

smitpatel commented 4 years ago

@cjstevenson - The value you provided as example does not cause any SQL injection. EF Core escapes quotes properly in the SQL. Generated SQL for example

      SELECT [b].[Id], [b].[Value]
      FROM [Blogs] AS [b]
      WHERE [b].[Value] IN (N'legit_value\''); SELECT * FROM some_table limit 10; -- ')

This is not a security issue. This issue has nothing to do with values and SQL injection, please file a new issue with repro steps if you are seeing a bug. Please refer to security policy if you are submitting a report for security issue.

cjstevenson commented 4 years ago

Thanks for pointing me to the security policy.

stevendarby commented 3 years ago

@ajcvickers as someone interested in this issue and your plans for 6.0, I was wondering what the categories mean in your EF Core Backlog. This has gone from Proposed to Backlog - does that mean it was proposed for the backlog and now it's in the backlog? And then next step would be to consider moving to Committed Next Release?

roji commented 3 years ago

@snappyfoo we're still working out the mechanics in our planning process and the Github project board. The important thing is that this issue has the consider-for-next-release label, which means it's definitely on the table for 6.0.

stevendarby commented 3 years ago

Been investigating solutions in the meantime and found that the string_split function idea from above can be separated out into its own queryable for flexible reuse. For example...

Add keyless entity to your model:

public class StringSplitResult
{
    public string Value { get; set; }
}

Some methods like this on your context:

public IQueryable<string> AsQuery(IEnumerable<string> values) =>
    SplitString(string.Join(",", values));

public IQueryable<string> SplitString(string values, string separator = ",") =>
    Set<StringSplitResult>()
        .FromSqlInterpolated($"SELECT Value FROM string_split({values}, {separator})")
        .Select(x => x.Value);

You could do something like this:

var codes = new[] {"CODE1", "CODE2"};
var item = context.Set<Entity>()
    .Where(entity => context.AsQuery(codes).Contains(entity.Code))
    .FirstOrDefault();

Which produces SQL:

Microsoft.EntityFrameworkCore.Database.Command: Information: Executed DbCommand (21ms) [Parameters=[p0='CODE1,CODE2' (Size = 4000), p1=',' (Size = 4000)], CommandType='Text', CommandTimeout='30']
SELECT TOP(1) [e].[Id], [e].[Code]
FROM [Entities] AS [e]
WHERE EXISTS (
    SELECT 1
    FROM (
        SELECT Value FROM string_split(@p0, @p1)
    ) AS [s]
    WHERE ([s].[Value] = [e].[Code]) OR ([s].[Value] IS NULL AND [e].[Code] IS NULL))
stevendarby commented 3 years ago

@cjstevenson - The value you provided as example does not cause any SQL injection. EF Core escapes quotes properly in the SQL. Generated SQL for example

      SELECT [b].[Id], [b].[Value]
      FROM [Blogs] AS [b]
      WHERE [b].[Value] IN (N'legit_value\''); SELECT * FROM some_table limit 10; -- ')

This is not a security issue. This issue has nothing to do with values and SQL injection, please file a new issue with repro steps if you are seeing a bug. Please refer to security policy if you are submitting a report for security issue.

Is it a security issue around sensitive values though? Sensitive data logging is not enabled by default and as a result SQL parameters are logged as '?' instead of the actual value. One might assume that the values of a list they use in a query with Contains might have this same protection. But as they're hardcoded into the SQL, they're logged.

The documentation for EnableSensitiveDataLogging() says "You should only enable this flag if you have the appropriate security measures in place based on the sensitivity of this data."

It may not be clear that the appropriate security measures might not be in place even with this flag disabled. This might just be a documentation issue. But it also seems to me that introducing parameterization for Contains / IN would allow those with very strict security policies around sensitive data to be able to log more.

mrlife commented 3 years ago

@roji Can this be considered for the 6.0 release?

roji commented 3 years ago

@mrlife unfortunately not - this is quite a non-trivial optimization, and we're now in a feature freeze, stabilizing the release. I'm very hopeful we'll be able to tackle this for 7.0.

ErikEJ commented 3 years ago

@mrlife there are multiple workarounds described here

mrlife commented 3 years ago

@ErikEJ Thank you, is there one that you like or is generally recommended?

ErikEJ commented 3 years ago

https://gist.github.com/ErikEJ/6ab62e8b9c226ecacf02a5e5713ff7bd

This one, with great comments and suggestions @mrlife

plittlewood-rpt commented 2 years ago

I have an API that switches DbContext from request to request based on the customer accessing it (this is because we use single tenant databases). We are seeing on occasion, after periods of high traffic, that the expressions compiled after using the .Contains() operator use old values, as though a cache is being exhausted.

To give you some context:

Users might call GET /myEndpoint?ownerId=ABC&ownerId=DEF

The query parameters get translated into SQL via LINQ using something similar to:

       if (query.OwnerIds!= null && query.OwnerIds.Any())
       {
           predicate = predicate.And(p => query.OwnerIds.Contains(p.OwnerId) || query.OwnerIds.Contains(p.OwnerId2) || query.OwnerIds.Contains(p.OwnerId3));
       }

The predicate then gets used when calling the database context ie: dbContext.Data.Where(predicate)

This correctly gets converted to a MySQL statement using IN () as I'd expect. If we take the URL example above (GET /myEndpoint?ownerId=ABC&ownerId=DEF) however, the query that gets executed against the database actually uses values from a previous query, so it might run WHERE ownerId IN ('GHI', 'JKL') OR ownerId2 IN ('GHI', 'JKL') OR ownerId3 IN ('GHI', 'JKL').

This only seems to occur after a period of high traffic has hit the API. Restarting the container that the API runs in resolves the issue, until it eventually happens again.

I can rewrite the LINQ to chain OR conditions together, which then results in a parameterized query but rather horrible SQL. Can anybody suggest anything that might be causing the underlying problem? I'm not bothered if the IN () is not parameterized IF I can explain and resolve what's going on with old values being used.

ErikEJ commented 2 years ago

@plittlewood-rpt which provider? Could be a provider Bug.

plittlewood-rpt commented 2 years ago

Hi @ErikEJ - we are using Pomelo with MySQL and I haven't been able to find anything relating to the problem we're having. Annoyingly I can't replicate this problem at all still no matter how much traffic I fire at a local development copy of the API.

ajcvickers commented 2 years ago

@plittlewood-rpt It doesn't seem like your error is directly related to the feature that this issue is tracking. Please open a new issue to continue the discussion.

We are likely to need to be able to reproduce what you are seeing to be able to help. However, please make sure that no DbContext instance is being used concurrently by multiple threads, and make sure that all async queries are being awaited before starting a new query or performing any other action on the context.

yv989c commented 2 years ago

Hello everyone! This is my take on providing a solution to this problem. It's now even more flexible by allowing you to compose a complex type.

I just added some benchmarks showing great results.

Please give it a try. I'm looking forward to your feedback.

Links:

roji commented 1 year ago

Here's another approach for solving this problem (link):

DECLARE @pSearchOptions NVARCHAR(4000) = N'[1,2,3,4]'

SELECT *
FROM products
INNER JOIN OPENJSON(@pSearchOptions) AS productTypes
 ON product.productTypeID = productTypes.value

This sends a single string parameter containing a JSON array of the IDs; in SQL Server, OPENJSON is used to parse that into a (temporary) table, which we then join with the target table.

We'd need to benchmark this approach to make sure that the temporary table doesn't create perf problems (as in our old SQL Server update pipeline). This would also require client-side parameter modification in our query pipeline, which we don't currently support. But other than that it seems like a pretty ideal solution.

yv989c commented 1 year ago

Here's another approach for solving this problem (link):

DECLARE @pSearchOptions NVARCHAR(4000) = N'[1,2,3,4]'

SELECT *
FROM products
INNER JOIN OPENJSON(@pSearchOptions) AS productTypes
 ON product.productTypeID = productTypes.value

This sends a single string parameter containing a JSON array of the IDs; in SQL Server, OPENJSON is used to parse that into a (temporary) table, which we then join with the target table.

We'd need to benchmark this approach to make sure that the temporary table doesn't create perf problems (as in our old SQL Server update pipeline). This would also require client-side parameter modification in our query pipeline, which we don't currently support. But other than that it seems like a pretty ideal solution.

Hi @roji. I recently added that support to QueryableValues for EF6 with great results. When OPENJSON is not available I fallback to XML methods. I have plans to add the same support to QueryableValues for EF Core.

I'm curious on what approach and strategy will EF end up implementing.

roji commented 1 year ago

NOTE: See additional benchmarking below with SqlClient EnableOptimizedParameterBinding

Did a quick perf investigation of parameterized vs. non-parameterized WHERE x IN (...) with lots of values.

Key notes and takeaways:

To summarize, I hope we can end up with a solution that doesn't use the SQL IN construct at all, thereby avoiding all of this (see the INNER JOIN-based solutions above).

Method NumValues Database IsFound IsParameterized Mean Error StdDev Median
Test 1 PostgreSQL False False 304.7 us 8.26 us 24.36 us 306.6 us
Test 1 PostgreSQL False True 304.9 us 8.67 us 25.56 us 311.0 us
Test 1 PostgreSQL True False 311.0 us 8.43 us 24.71 us 314.0 us
Test 1 PostgreSQL True True 314.7 us 9.22 us 27.20 us 317.2 us
Test 10 PostgreSQL False False 331.9 us 14.88 us 43.88 us 343.4 us
Test 10 PostgreSQL False True 344.7 us 11.92 us 35.16 us 346.0 us
Test 10 PostgreSQL True False 350.0 us 12.44 us 36.69 us 356.4 us
Test 10 PostgreSQL True True 357.2 us 11.32 us 33.37 us 359.6 us
Test 100 PostgreSQL False False 671.5 us 34.80 us 101.51 us 656.9 us
Test 100 PostgreSQL False True 724.5 us 48.19 us 142.10 us 716.4 us
Test 100 PostgreSQL True False 719.6 us 50.41 us 147.83 us 695.3 us
Test 100 PostgreSQL True True 730.6 us 50.48 us 148.86 us 715.9 us
Test 1000 PostgreSQL False False 9,738.7 us 53.92 us 45.02 us 9,746.8 us
Test 1000 PostgreSQL False True 10,021.4 us 199.92 us 273.65 us 9,981.0 us
Test 1000 PostgreSQL True False 9,299.2 us 182.40 us 355.76 us 9,247.7 us
Test 1000 PostgreSQL True True 10,170.5 us 201.69 us 207.12 us 10,190.8 us
Test 1 SQLServer False False 541.9 us 10.80 us 18.34 us 548.0 us
Test 1 SQLServer False True 546.4 us 10.88 us 11.64 us 546.3 us
Test 1 SQLServer True False 547.8 us 10.86 us 25.81 us 554.7 us
Test 1 SQLServer True True 552.8 us 10.89 us 16.63 us 554.0 us
Test 10 SQLServer False False 582.3 us 11.44 us 20.63 us 590.0 us
Test 10 SQLServer False True 808.9 us 23.28 us 68.65 us 825.0 us
Test 10 SQLServer True False 587.7 us 11.72 us 28.53 us 595.7 us
Test 10 SQLServer True True 817.7 us 26.45 us 77.99 us 845.9 us
Test 100 SQLServer False False 1,046.6 us 63.10 us 186.05 us 1,111.1 us
Test 100 SQLServer False True 1,853.9 us 76.86 us 226.62 us 1,878.4 us
Test 100 SQLServer True False 1,088.7 us 57.03 us 168.16 us 1,134.6 us
Test 100 SQLServer True True 1,900.9 us 99.60 us 293.69 us 1,949.8 us
Test 1000 SQLServer False False 2,522.5 us 67.90 us 199.15 us 2,524.7 us
Test 1000 SQLServer False True 20,636.4 us 329.94 us 275.52 us 20,653.9 us
Test 1000 SQLServer True False 20,642.5 us 410.18 us 770.42 us 20,602.6 us
Test 1000 SQLServer True True 20,818.2 us 412.19 us 677.24 us 20,777.1 us
Results as a text-only table ``` | Method | NumValues | Database | IsFound | IsParameterized | Mean | Error | StdDev | Median | |------- |---------- |----------- |-------- |---------------- |------------:|----------:|----------:|------------:| | Test | 1 | PostgreSQL | False | False | 304.7 us | 8.26 us | 24.36 us | 306.6 us | | Test | 1 | PostgreSQL | False | True | 304.9 us | 8.67 us | 25.56 us | 311.0 us | | Test | 1 | PostgreSQL | True | False | 311.0 us | 8.43 us | 24.71 us | 314.0 us | | Test | 1 | PostgreSQL | True | True | 314.7 us | 9.22 us | 27.20 us | 317.2 us | | Test | 10 | PostgreSQL | False | False | 331.9 us | 14.88 us | 43.88 us | 343.4 us | | Test | 10 | PostgreSQL | False | True | 344.7 us | 11.92 us | 35.16 us | 346.0 us | | Test | 10 | PostgreSQL | True | False | 350.0 us | 12.44 us | 36.69 us | 356.4 us | | Test | 10 | PostgreSQL | True | True | 357.2 us | 11.32 us | 33.37 us | 359.6 us | | Test | 100 | PostgreSQL | False | False | 671.5 us | 34.80 us | 101.51 us | 656.9 us | | Test | 100 | PostgreSQL | False | True | 724.5 us | 48.19 us | 142.10 us | 716.4 us | | Test | 100 | PostgreSQL | True | False | 719.6 us | 50.41 us | 147.83 us | 695.3 us | | Test | 100 | PostgreSQL | True | True | 730.6 us | 50.48 us | 148.86 us | 715.9 us | | Test | 1000 | PostgreSQL | False | False | 9,738.7 us | 53.92 us | 45.02 us | 9,746.8 us | | Test | 1000 | PostgreSQL | False | True | 10,021.4 us | 199.92 us | 273.65 us | 9,981.0 us | | Test | 1000 | PostgreSQL | True | False | 9,299.2 us | 182.40 us | 355.76 us | 9,247.7 us | | Test | 1000 | PostgreSQL | True | True | 10,170.5 us | 201.69 us | 207.12 us | 10,190.8 us | | Test | 1 | SQLServer | False | False | 541.9 us | 10.80 us | 18.34 us | 548.0 us | | Test | 1 | SQLServer | False | True | 546.4 us | 10.88 us | 11.64 us | 546.3 us | | Test | 1 | SQLServer | True | False | 547.8 us | 10.86 us | 25.81 us | 554.7 us | | Test | 1 | SQLServer | True | True | 552.8 us | 10.89 us | 16.63 us | 554.0 us | | Test | 10 | SQLServer | False | False | 582.3 us | 11.44 us | 20.63 us | 590.0 us | | Test | 10 | SQLServer | False | True | 808.9 us | 23.28 us | 68.65 us | 825.0 us | | Test | 10 | SQLServer | True | False | 587.7 us | 11.72 us | 28.53 us | 595.7 us | | Test | 10 | SQLServer | True | True | 817.7 us | 26.45 us | 77.99 us | 845.9 us | | Test | 100 | SQLServer | False | False | 1,046.6 us | 63.10 us | 186.05 us | 1,111.1 us | | Test | 100 | SQLServer | False | True | 1,853.9 us | 76.86 us | 226.62 us | 1,878.4 us | | Test | 100 | SQLServer | True | False | 1,088.7 us | 57.03 us | 168.16 us | 1,134.6 us | | Test | 100 | SQLServer | True | True | 1,900.9 us | 99.60 us | 293.69 us | 1,949.8 us | | Test | 1000 | SQLServer | False | False | 2,522.5 us | 67.90 us | 199.15 us | 2,524.7 us | | Test | 1000 | SQLServer | False | True | 20,636.4 us | 329.94 us | 275.52 us | 20,653.9 us | | Test | 1000 | SQLServer | True | False | 20,642.5 us | 410.18 us | 770.42 us | 20,602.6 us | | Test | 1000 | SQLServer | True | True | 20,818.2 us | 412.19 us | 677.24 us | 20,777.1 us | ```
Benchmark code ```c# BenchmarkRunner.Run(); public class Benchmark { private DbConnection _connection; private DbCommand _command; private const int Rows = 100000; [Params(1, 10, 100, 1000)] public int NumValues { get; set; } [Params("SQLServer", "PostgreSQL")] public string Database { get; set; } [Params(true, false)] public bool IsFound { get; set; } [Params(true, false)] public bool IsParameterized { get; set; } private Guid _existingGuid; [GlobalSetup] public Task Setup() => Database switch { "SQLServer" => SetupSqlServer(), "PostgreSQL" => SetupPostgreSQL(), _ => throw new NotSupportedException("Unknown database: " + Database) }; private async Task SetupSqlServer() { var connection = new SqlConnection("Server=localhost;Database=test;User=SA;Password=Abcd5678;Connect Timeout=60;ConnectRetryCount=0;Encrypt=false"); await connection.OpenAsync(); var command = connection.CreateCommand(); command.CommandText = """ DROP TABLE IF EXISTS data; CREATE TABLE data (id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWSEQUENTIALID(), some_int INT); """; await command.ExecuteNonQueryAsync(); foreach (var chunk in Enumerable.Range(0, Rows).Chunk(1000)) { var builder = new StringBuilder("INSERT INTO data (some_int) VALUES "); for (var i = 0; i < chunk.Length; i++) { if (i > 0) builder.Append(", "); builder.Append($"({chunk[i]})"); } command.CommandText = builder.ToString(); await command.ExecuteNonQueryAsync(); } command.CommandText = "SELECT TOP(1) id FROM data"; _existingGuid = (Guid)(await command.ExecuteScalarAsync())!; var builder2 = new StringBuilder("SELECT some_int FROM data WHERE id IN ("); command.Parameters.Clear(); if (IsParameterized) { for (var i = 0; i < NumValues; i++) { if (i > 0) builder2.Append(", "); builder2.Append("@p" + i); command.Parameters.AddWithValue("p" + i, i == NumValues - 1 && IsFound ? _existingGuid : Guid.NewGuid()); } } else { for (var i = 0; i < NumValues; i++) { if (i > 0) builder2.Append(", "); builder2 .Append("'") .Append(i == NumValues - 1 && IsFound ? _existingGuid : Guid.NewGuid()) .Append("'"); } } builder2.Append(")"); command.CommandText = builder2.ToString(); _connection = connection; _command = command; } private async Task SetupPostgreSQL() { var connection = new NpgsqlConnection("Server=localhost;Username=test;Password=test"); await connection.OpenAsync(); var command = connection.CreateCommand(); command.CommandText = """ DROP TABLE IF EXISTS data; CREATE TABLE data (id uuid PRIMARY KEY DEFAULT gen_random_uuid(), some_int INT); """; await command.ExecuteNonQueryAsync(); foreach (var chunk in Enumerable.Range(0, Rows).Chunk(1000)) { var builder = new StringBuilder("INSERT INTO data (some_int) VALUES "); for (var i = 0; i < chunk.Length; i++) { if (i > 0) builder.Append(", "); builder.Append($"({chunk[i]})"); } command.CommandText = builder.ToString(); await command.ExecuteNonQueryAsync(); } command.CommandText = "SELECT id FROM data LIMIT 1"; _existingGuid = (Guid)(await command.ExecuteScalarAsync())!; var builder2 = new StringBuilder("SELECT some_int FROM data WHERE id IN ("); command.Parameters.Clear(); if (IsParameterized) { for (var i = 0; i < NumValues; i++) { if (i > 0) builder2.Append(", "); builder2.Append("$" + (i + 1)); command.Parameters.Add(new() { Value = i == NumValues - 1 && IsFound ? _existingGuid : Guid.NewGuid() }); } } else { for (var i = 0; i < NumValues; i++) { if (i > 0) builder2.Append(", "); builder2 .Append("'") .Append(i == NumValues - 1 && IsFound ? _existingGuid : Guid.NewGuid()) .Append("'"); } } builder2.Append(")"); command.CommandText = builder2.ToString(); _connection = connection; _command = command; } [Benchmark] public Task Test() => _command.ExecuteScalarAsync(); [GlobalCleanup] public ValueTask Cleanup() => _connection.DisposeAsync(); } ```
roji commented 1 year ago

Ran the benchmarks again with SqlClient's EnableOptimizedParameterBinding. Bottom line: constants are generally better, except when we start to use a large number of parameters (significantly over 100), in which case parameters with EnableOptimizedParameterBinding are better. This again doesn't take into account the effects of plan pollution due to varying SQLs in the constants case.

  • EnableOptimizedParameterBinding only helps little up to 100 parameters, but has a pretty dramatic effect on the 1000-parameter case: it is far more efficient to do parameters with EnableOptimizedParameterBinding in that case than constants.
  • Interestingly, even without EnableOptimizedParameterBinding on, parameterization is the same or even slightly faster than constants in the 1000 parameter case (this is interestingly also the case on PostgreSQL).
Method NumValues Database IsFound Parameterization Mean Error StdDev Median
Test 1 PostgreSQL False Off 300.6 us 9.28 us 27.35 us 306.6 us
Test 1 PostgreSQL False On 313.2 us 8.96 us 26.43 us 316.3 us
Test 1 PostgreSQL False Optimized NA NA NA NA
Test 1 PostgreSQL True Off 316.7 us 7.57 us 22.33 us 319.2 us
Test 1 PostgreSQL True On 312.5 us 8.93 us 26.34 us 316.9 us
Test 1 PostgreSQL True Optimized NA NA NA NA
Test 10 PostgreSQL False Off 339.6 us 10.81 us 31.88 us 342.7 us
Test 10 PostgreSQL False On 349.0 us 10.66 us 31.44 us 351.1 us
Test 10 PostgreSQL False Optimized NA NA NA NA
Test 10 PostgreSQL True Off 366.0 us 11.15 us 32.86 us 369.2 us
Test 10 PostgreSQL True On 352.5 us 11.04 us 32.36 us 355.0 us
Test 10 PostgreSQL True Optimized NA NA NA NA
Test 100 PostgreSQL False Off 676.2 us 33.29 us 93.90 us 662.1 us
Test 100 PostgreSQL False On 727.9 us 45.89 us 133.14 us 726.3 us
Test 100 PostgreSQL False Optimized NA NA NA NA
Test 100 PostgreSQL True Off 680.3 us 41.32 us 119.22 us 670.4 us
Test 100 PostgreSQL True On 763.6 us 64.45 us 187.99 us 743.1 us
Test 100 PostgreSQL True Optimized NA NA NA NA
Test 1000 PostgreSQL False Off 9,349.7 us 185.61 us 437.50 us 9,151.1 us
Test 1000 PostgreSQL False On 10,031.3 us 193.72 us 230.62 us 10,006.1 us
Test 1000 PostgreSQL False Optimized NA NA NA NA
Test 1000 PostgreSQL True Off 9,103.1 us 173.19 us 162.00 us 9,027.1 us
Test 1000 PostgreSQL True On 10,150.7 us 139.09 us 130.11 us 10,152.1 us
Test 1000 PostgreSQL True Optimized NA NA NA NA
Test 1 SQLServer False Off 536.4 us 10.68 us 24.12 us 541.9 us
Test 1 SQLServer False On 537.1 us 10.72 us 16.69 us 539.9 us
Test 1 SQLServer False Optimized 536.1 us 10.63 us 18.62 us 538.7 us
Test 1 SQLServer True Off 547.3 us 10.92 us 22.06 us 553.6 us
Test 1 SQLServer True On 538.3 us 12.70 us 37.43 us 550.2 us
Test 1 SQLServer True Optimized 541.1 us 10.70 us 21.62 us 547.8 us
Test 10 SQLServer False Off 575.5 us 11.47 us 23.16 us 581.6 us
Test 10 SQLServer False On 791.3 us 26.33 us 77.64 us 802.3 us
Test 10 SQLServer False Optimized 793.3 us 23.82 us 70.23 us 809.9 us
Test 10 SQLServer True Off 591.2 us 11.76 us 16.09 us 593.6 us
Test 10 SQLServer True On 800.0 us 25.88 us 76.32 us 819.6 us
Test 10 SQLServer True Optimized 808.4 us 19.52 us 57.57 us 830.2 us
Test 100 SQLServer False Off 1,085.1 us 54.93 us 161.97 us 1,142.9 us
Test 100 SQLServer False On 1,822.4 us 89.02 us 262.48 us 1,881.3 us
Test 100 SQLServer False Optimized 1,730.4 us 82.35 us 242.82 us 1,812.0 us
Test 100 SQLServer True Off 1,091.9 us 51.75 us 152.59 us 1,140.6 us
Test 100 SQLServer True On 1,788.3 us 97.34 us 287.02 us 1,849.6 us
Test 100 SQLServer True Optimized 1,721.2 us 82.07 us 242.00 us 1,773.7 us
Test 1000 SQLServer False Off 20,743.4 us 412.20 us 1,142.20 us 20,454.1 us
Test 1000 SQLServer False On 20,719.6 us 411.33 us 698.47 us 20,589.3 us
Test 1000 SQLServer False Optimized 5,667.7 us 215.82 us 632.95 us 5,720.9 us
Test 1000 SQLServer True Off 20,166.3 us 399.08 us 719.63 us 19,988.5 us
Test 1000 SQLServer True On 21,198.7 us 360.22 us 336.95 us 21,100.2 us
Test 1000 SQLServer True Optimized 5,889.5 us 198.02 us 583.86 us 5,989.1 us
Results as a text-only table ``` | Method | NumValues | Database | IsFound | Parameterization | Mean | Error | StdDev | Median | |------- |---------- |----------- |-------- |----------------- |------------:|----------:|------------:|------------:| | Test | 1 | PostgreSQL | False | Off | 300.6 us | 9.28 us | 27.35 us | 306.6 us | | Test | 1 | PostgreSQL | False | On | 313.2 us | 8.96 us | 26.43 us | 316.3 us | | Test | 1 | PostgreSQL | False | Optimized | NA | NA | NA | NA | | Test | 1 | PostgreSQL | True | Off | 316.7 us | 7.57 us | 22.33 us | 319.2 us | | Test | 1 | PostgreSQL | True | On | 312.5 us | 8.93 us | 26.34 us | 316.9 us | | Test | 1 | PostgreSQL | True | Optimized | NA | NA | NA | NA | | Test | 10 | PostgreSQL | False | Off | 339.6 us | 10.81 us | 31.88 us | 342.7 us | | Test | 10 | PostgreSQL | False | On | 349.0 us | 10.66 us | 31.44 us | 351.1 us | | Test | 10 | PostgreSQL | False | Optimized | NA | NA | NA | NA | | Test | 10 | PostgreSQL | True | Off | 366.0 us | 11.15 us | 32.86 us | 369.2 us | | Test | 10 | PostgreSQL | True | On | 352.5 us | 11.04 us | 32.36 us | 355.0 us | | Test | 10 | PostgreSQL | True | Optimized | NA | NA | NA | NA | | Test | 100 | PostgreSQL | False | Off | 676.2 us | 33.29 us | 93.90 us | 662.1 us | | Test | 100 | PostgreSQL | False | On | 727.9 us | 45.89 us | 133.14 us | 726.3 us | | Test | 100 | PostgreSQL | False | Optimized | NA | NA | NA | NA | | Test | 100 | PostgreSQL | True | Off | 680.3 us | 41.32 us | 119.22 us | 670.4 us | | Test | 100 | PostgreSQL | True | On | 763.6 us | 64.45 us | 187.99 us | 743.1 us | | Test | 100 | PostgreSQL | True | Optimized | NA | NA | NA | NA | | Test | 1000 | PostgreSQL | False | Off | 9,349.7 us | 185.61 us | 437.50 us | 9,151.1 us | | Test | 1000 | PostgreSQL | False | On | 10,031.3 us | 193.72 us | 230.62 us | 10,006.1 us | | Test | 1000 | PostgreSQL | False | Optimized | NA | NA | NA | NA | | Test | 1000 | PostgreSQL | True | Off | 9,103.1 us | 173.19 us | 162.00 us | 9,027.1 us | | Test | 1000 | PostgreSQL | True | On | 10,150.7 us | 139.09 us | 130.11 us | 10,152.1 us | | Test | 1000 | PostgreSQL | True | Optimized | NA | NA | NA | NA | | Test | 1 | SQLServer | False | Off | 536.4 us | 10.68 us | 24.12 us | 541.9 us | | Test | 1 | SQLServer | False | On | 537.1 us | 10.72 us | 16.69 us | 539.9 us | | Test | 1 | SQLServer | False | Optimized | 536.1 us | 10.63 us | 18.62 us | 538.7 us | | Test | 1 | SQLServer | True | Off | 547.3 us | 10.92 us | 22.06 us | 553.6 us | | Test | 1 | SQLServer | True | On | 538.3 us | 12.70 us | 37.43 us | 550.2 us | | Test | 1 | SQLServer | True | Optimized | 541.1 us | 10.70 us | 21.62 us | 547.8 us | | Test | 10 | SQLServer | False | Off | 575.5 us | 11.47 us | 23.16 us | 581.6 us | | Test | 10 | SQLServer | False | On | 791.3 us | 26.33 us | 77.64 us | 802.3 us | | Test | 10 | SQLServer | False | Optimized | 793.3 us | 23.82 us | 70.23 us | 809.9 us | | Test | 10 | SQLServer | True | Off | 591.2 us | 11.76 us | 16.09 us | 593.6 us | | Test | 10 | SQLServer | True | On | 800.0 us | 25.88 us | 76.32 us | 819.6 us | | Test | 10 | SQLServer | True | Optimized | 808.4 us | 19.52 us | 57.57 us | 830.2 us | | Test | 100 | SQLServer | False | Off | 1,085.1 us | 54.93 us | 161.97 us | 1,142.9 us | | Test | 100 | SQLServer | False | On | 1,822.4 us | 89.02 us | 262.48 us | 1,881.3 us | | Test | 100 | SQLServer | False | Optimized | 1,730.4 us | 82.35 us | 242.82 us | 1,812.0 us | | Test | 100 | SQLServer | True | Off | 1,091.9 us | 51.75 us | 152.59 us | 1,140.6 us | | Test | 100 | SQLServer | True | On | 1,788.3 us | 97.34 us | 287.02 us | 1,849.6 us | | Test | 100 | SQLServer | True | Optimized | 1,721.2 us | 82.07 us | 242.00 us | 1,773.7 us | | Test | 1000 | SQLServer | False | Off | 20,743.4 us | 412.20 us | 1,142.20 us | 20,454.1 us | | Test | 1000 | SQLServer | False | On | 20,719.6 us | 411.33 us | 698.47 us | 20,589.3 us | | Test | 1000 | SQLServer | False | Optimized | 5,667.7 us | 215.82 us | 632.95 us | 5,720.9 us | | Test | 1000 | SQLServer | True | Off | 20,166.3 us | 399.08 us | 719.63 us | 19,988.5 us | | Test | 1000 | SQLServer | True | On | 21,198.7 us | 360.22 us | 336.95 us | 21,100.2 us | | Test | 1000 | SQLServer | True | Optimized | 5,889.5 us | 198.02 us | 583.86 us | 5,989.1 us | ```
Benchmark code ```c# BenchmarkRunner.Run(); public class Benchmark { private DbConnection _connection; private DbCommand _command; private const int Rows = 100000; [Params(1, 10, 100, 1000)] public int NumValues { get; set; } [Params("SQLServer", "PostgreSQL")] public string Database { get; set; } [Params(true, false)] public bool IsFound { get; set; } [Params(Parameterization.Off, Parameterization.On, Parameterization.Optimized)] public Parameterization Parameterization { get; set; } private Guid _existingGuid; [GlobalSetup] public Task Setup() => Database switch { "SQLServer" => SetupSqlServer(), "PostgreSQL" => SetupPostgreSQL(), _ => throw new NotSupportedException("Unknown database: " + Database) }; private async Task SetupSqlServer() { var connection = new SqlConnection("Server=localhost;Database=test;User=SA;Password=Abcd5678;Connect Timeout=60;ConnectRetryCount=0;Encrypt=false"); await connection.OpenAsync(); var command = connection.CreateCommand(); command.CommandText = """ DROP TABLE IF EXISTS data; CREATE TABLE data (id UNIQUEIDENTIFIER PRIMARY KEY DEFAULT NEWSEQUENTIALID(), some_int INT); """; await command.ExecuteNonQueryAsync(); foreach (var chunk in Enumerable.Range(0, Rows).Chunk(1000)) { var builder = new StringBuilder("INSERT INTO data (some_int) VALUES "); for (var i = 0; i < chunk.Length; i++) { if (i > 0) builder.Append(", "); builder.Append($"({chunk[i]})"); } command.CommandText = builder.ToString(); await command.ExecuteNonQueryAsync(); } command.CommandText = "SELECT TOP(1) id FROM data"; _existingGuid = (Guid)(await command.ExecuteScalarAsync())!; var builder2 = new StringBuilder("SELECT some_int FROM data WHERE id IN ("); command.Parameters.Clear(); if (Parameterization == Parameterization.Off) { for (var i = 0; i < NumValues; i++) { if (i > 0) builder2.Append(", "); builder2 .Append("'") .Append(i == NumValues - 1 && IsFound ? _existingGuid : Guid.NewGuid()) .Append("'"); } } else { if (Parameterization == Parameterization.Optimized) { command.EnableOptimizedParameterBinding = true; } for (var i = 0; i < NumValues; i++) { if (i > 0) builder2.Append(", "); builder2.Append("@p" + i); command.Parameters.AddWithValue("p" + i, i == NumValues - 1 && IsFound ? _existingGuid : Guid.NewGuid()); } } builder2.Append(")"); command.CommandText = builder2.ToString(); _connection = connection; _command = command; } private async Task SetupPostgreSQL() { if (Parameterization == Parameterization.Optimized) throw new NotSupportedException(); var connection = new NpgsqlConnection("Server=localhost;Username=test;Password=test"); await connection.OpenAsync(); var command = connection.CreateCommand(); command.CommandText = """ DROP TABLE IF EXISTS data; CREATE TABLE data (id uuid PRIMARY KEY DEFAULT gen_random_uuid(), some_int INT); """; await command.ExecuteNonQueryAsync(); foreach (var chunk in Enumerable.Range(0, Rows).Chunk(1000)) { var builder = new StringBuilder("INSERT INTO data (some_int) VALUES "); for (var i = 0; i < chunk.Length; i++) { if (i > 0) builder.Append(", "); builder.Append($"({chunk[i]})"); } command.CommandText = builder.ToString(); await command.ExecuteNonQueryAsync(); } command.CommandText = "SELECT id FROM data LIMIT 1"; _existingGuid = (Guid)(await command.ExecuteScalarAsync())!; var builder2 = new StringBuilder("SELECT some_int FROM data WHERE id IN ("); command.Parameters.Clear(); if (Parameterization == Parameterization.Off) { for (var i = 0; i < NumValues; i++) { if (i > 0) builder2.Append(", "); builder2 .Append("'") .Append(i == NumValues - 1 && IsFound ? _existingGuid : Guid.NewGuid()) .Append("'"); } } else { for (var i = 0; i < NumValues; i++) { if (i > 0) builder2.Append(", "); builder2.Append("$" + (i + 1)); command.Parameters.Add(new() { Value = i == NumValues - 1 && IsFound ? _existingGuid : Guid.NewGuid() }); } } builder2.Append(")"); command.CommandText = builder2.ToString(); _connection = connection; _command = command; } [Benchmark] public Task Test() => _command.ExecuteScalarAsync(); [GlobalCleanup] public ValueTask Cleanup() => _connection.DisposeAsync(); } public enum Parameterization { Off, On, Optimized } ```
stevendarby commented 1 year ago

Out of interest, has plan pollution been confirmed when there is a fixed number of constants? SQL Server can sometimes parameterise constants when the database "Parameterization" setting is set to "Simple", which I believe is the default. There is also a more extreme "Forced" option. I don't know much about this setting and if it works for IN(...) constants. And obviously this won't help when there's a variable number of constants- but I'm just thinking these parameterization options might be something to get a good grasp of before doing further optimisations. (And could possibly be a workaround for some in the meantime)

ErikEJ commented 1 year ago

@stevendarby I should be easy enough to check: https://www.sqlshack.com/understanding-sql-server-query-plan-cache/

stevendarby commented 1 year ago

Thanks @ErikEJ. I did a quick test and SELECT * FROM Table WHERE Id IN (1, 2, 3) was not parameterized with the default "Simple" setting. However it was parameterized with the "Forced" setting.

The forced setting probably comes with warnings and caveats so probably not something to recommend generally, but perhaps something to explore in more depth if anyone is looking for a workaround.

roji commented 1 year ago

@stevendarby it's an interesting direction, but at least in the general case there are also varying number of values in the IN clause, so it's limited in any case. That's why an INNER JOIN-based solution could be a better way forward here.

stevendarby commented 1 year ago

@roji I said as much in my posts so don't think I've oversold it as an option for some to explore- definitely some work to do for the general case :)

stevendarby commented 1 year ago

Here's another approach for solving this problem (link):

Credit where credit is due on the OPENJSON suggestion :) https://github.com/dotnet/efcore/issues/13617#issuecomment-491399486

Just a few thoughts on this- it'd be interesting to compare the performance of this to split_string under a variety of scenarios. They are both table values functions that turn a string into a table. Even if split_string is faster (which seems plausible as parsing may be simpler), OPENJSON is probably more robust for string data, where choosing a delimiter becomes problematic.

A feature of both seems to be that SQL Server estimates 50 rows in the resulting table. It could be worthwhile measuring the impact this has under a variety of scenarios (e.g. where there are actually significantly more rows) and whether anything can be done to influence the estimate, as EF Core would know the actual number before sending the query.

ErikEJ commented 1 year ago

@stevendarby Looks like @aaronbertrand did quite a lot of testing in this area already, as always "it depends" but json is not all bad: https://sqlperformance.com/2016/04/sql-server-2016/string-split-follow-up-2

roji commented 1 year ago

Credit where credit is due on the OPENJSON suggestion :) https://github.com/dotnet/efcore/issues/13617#issuecomment-491399486

Indeed!

[...] whether anything can be done to influence the estimate, as EF Core would know the actual number before sending the query.

Sure... Though we'd probably want to avoid anything that involves embedding a (constant) query hint or similar in the SQL, as that would bring us back to the original problem (SQL which depends on the collection size).

Thanks @ErikEJ for the link! Yeah, ideally JSON would perform at least nearly as well, allowing us to just always use that and not worry about delimiters.

roji commented 1 year ago

Note: #30426 tracks generally translating operations involving array parameters via OPENJSON, not just Contains as in this issue. We should probably do both these issues together, to make sure the implementation is general and not too Contains-specific.

One possible problem: the OPENJSON approach works really nicely for ints and strings, but I'm not sure what happens with types that don't exist in JSON; for example, we have to be careful around arrays of timestamps, GUIDs etc. - our client-generated JSON string must contain the exact JSON representation that OPENJSON will properly unpack on the server.

roji commented 1 year ago

Note: OPENJSON/string join approaches rely on the ability to transform parameter values on the client-side before sending them (e.g. the array parameter is transformed into a JSON representation string of itself). General client-side parameter transformations are tracked by #28028.

The alternative would be to add a full type mapping for array types, value-converting them to their JSON representation. That has a much wider scope of change, and would also allow mapping primitve array properties to JSON columns (#29427).

Giorgi commented 1 year ago

I wonder what effect will it have to rewrite (if it's possible) a Contains query to a query that first gets an IQueryable<int> (or other depending on the type of filter column condition) and calls IQueryable<int>.Contains with it. Will this result in better performance and cache-friendly queries?

ErikEJ commented 1 year ago

@Giorgi the important factors are if the SQL query is parameterized and has a limited number of variations in terms of number of parameters.

Giorgi commented 1 year ago

@ErikEJ If you use IQueryable<int>.Contains there is no need to parameterize IN() clause anymore because EF will generate something like this: ... Where Id In (first query to get Ids)

yv989c commented 1 year ago

👋 @Giorgi, if you are looking for an immediate solution to this problem you may find https://github.com/yv989c/BlazarTech.QueryableValues useful.

roji commented 1 year ago

@Giorgi what's that Ids part, how does the array of e.g. ints actually get sent to SQL Server?

Giorgi commented 1 year ago

@roji This is what I mean. Very often I have seen code like this:

var postIds  = ctx.Posts.Where(p => p.Date > somedate).Select(p=>p.Id).ToList();

//This will generate not parameterized In clause as described in this issue
var comments = ctx.Comments.Where(c => postIds.Contains(c.PostId));

This can be rewritten as

var postIds  = ctx.Posts.Where(p => p.Date > somedate).Select(p=>p.Id);

var comments = ctx.Comments.Where(c => postIds.Contains(c.PostId));

In this case the generated SQL will roughly look like this:

Select * from comments c where c.PostId in (Select postId from Posts where date > somedate))

This query will not suffer from the issue described in this issue because there are no hard-coded values. So my question was if this will also have the performance boost described in other workarounds and generate cache-friendly plan.

roji commented 1 year ago

@Giorgi sure, you're right - though that's the "easy" case. The problem is when you really do have an array of ints that came from somewhere, not another query that you can just inline into your query as a subquery.