DuendeSoftware / Support

Support for Duende Software products
20 stars 0 forks source link

TokenCleanupService creates a sub optimal query plan with an expensive keylookup #1304

Closed pregress closed 1 week ago

pregress commented 3 weeks ago

Which version of Duende IdentityServer are you using? 7.0.5

Which version of .NET are you using? .NET 8

Describe the bug

https://github.com/DuendeSoftware/IdentityServer/blob/e9860c6488f90e8fbc11a4452b9dd111dbfae933/src/EntityFramework.Storage/TokenCleanup/TokenCleanupService.cs#L92

Fetches the whole PersistedGrant object for a cleanup, This causes very expensive key lookups. See execution plan: https://www.brentozar.com/pastetheplan/?id=SyUSxtbI0 This could be fixed by using the existing index correctly IX_PersistedGrants_Expiration if only Id & Expiration are returned. But these requires a change to the public api of IOperationalStoreNotification since this works with an IEnumerable<PersistedGrant>

The fix would be:

ar query = _persistedGrantDbContext.PersistedGrants
                .Where(x => x.Expiration < DateTime.UtcNow)
                .OrderBy(x => x.Expiration)
                .Select(x => new { x.Id, x.Expiration });
// ..
if (_operationalStoreNotification != null)
{
    await _operationalStoreNotification.PersistedGrantsRemovedAsync(expiredGrants); // <= different type maybe or seperate implementation if _operationalStoreNotification is null. 
}

To Reproduce

  1. Have a lot of PersistedGrants
  2. Run the token cleanup
  3. Check query store.

Expected behavior

Cleanup should not produce an execution plan with a key lookup.

Log output/exception with stacktrace N/A

Additional context Adding include columns to the existing index would also solve this, but then it's just a copy of the whole table.

Query plan also contains the openjson (https://github.com/DuendeSoftware/IdentityServer/issues/1564) which also screws with the estimated number of rows..., while the estimated number of rows should always be equal to the batch size. In our case it's not a big deal, but could imagine that this would go wrong.

AndersAbel commented 3 weeks ago

Thank you for sharing the details of the performance of this query. We rewrote the token cleanup for v7 with the intention to fix race conditions and also make it more performant.

But we do have some limitations:

But these requires a change to the public api of IOperationalStoreNotification since this works with an IEnumerable

We are very careful when introducing breaking changes. We considered the fix you propose to not have to load the full entities, but decided that it didn't motivate a breaking change. The fix we considered was to check if there are any subscribers to the notifications and if not, simply not load the entities at all.

It is very valuable for us to get the actual performance numbers and the query plan details. The conclusions I make from this is that:

  1. The query utilizes the index as expected which is a very efficient operation.
  2. A clustered index key lookup is usually a fast operation, so while it takes 99% of the cost in this query, it might indicate that the index scan is even faster.

Do you have any actual durations for this, i.e. what is the query duration in milliseconds?

Any optimizations to this query would also have to take the next query into account. This query will bring all the relevant database pages into RAM and CPU cache. The next query deletes the data and that operation requires access to the database pages too. That next query would benefit from the database pages already being in RAM/cache memory.

Regarding the OpenJSON issue that is really not an IdentityServer issue. It is an incompatibility between EF Core and SQL Server. If it would be a major performance issue we would still have to handle it. But our estimate this far has been that the way our queries are written the OpenJSON row estimate issue should not be an issue.

If you have a query plan for the deletion query that would be valuable to see to understand how OpenJSON is handled.

pregress commented 2 weeks ago

I agree, the query uses the index, and this is efficient, but the key look up make the query plan suboptimal.

Here are the performance metrics for:

Delete expired with subquery

(@__First_Expiration_0 datetime2(7),@__Last_Expiration_1 datetime2(7),@__foundIds_2 nvarchar(4000))DELETE FROM [p]
FROM [PersistedGrants] AS [p]
WHERE [p].[Id] IN (
    SELECT [p0].[Id]
    FROM [PersistedGrants] AS [p0]
    WHERE [p0].[Expiration] < GETUTCDATE() AND [p0].[Expiration] >= @__First_Expiration_0 AND [p0].[Expiration] <= @__Last_Expiration_1 AND [p0].[Id] IN (
        SELECT [f].[value]
        FROM OPENJSON(@__foundIds_2) WITH ([value] bigint '$') AS [f]
    )
)
INTERVAL CPU[%] DATA IO[%] LOG IO[%] DURATION[hh:mm:ss] EXECUTIONS COUNT
5/26 - 5/27 0.06 0.61 0.17 00:09:40.120 444
5/27 - 5/28 0.06 0.55 0.16 00:13:57.300 403
5/28 - 5/29 0.07 0.7 0.16 00:14:22.730 420
5/29 - 5/30 0.08 0.73 0.18 00:14:30.200 455
5/30 - 5/31 0.09 0.98 0.19 00:19:15.630 474
5/31 - 6/1 0.09 1.12 0.19 00:43:22.420 460
6/1 - 6/2 0.08 0.83 0.15 00:21:14.860 374
6/2 - 6/3 0.06 0.82 0.08 00:51:13.220 203
6/3 - 6/4 0.11 1.14 0.27 00:29:57.910 692

Here is the execution plan for the actual delete query: https://www.brentozar.com/pastetheplan/?id=B1SS51dL0 Also containing a table spool although it's only a 6% cost, I don't think there is anything wrong with the actual delete. Writing is expensive on slow azure storage. https://erikdarling.com/optimizing-sql-server-query-plans-with-table-spools-in-them/

Select expired records (with expensive keylookup) INTERVAL CPU[%] DATA IO[%] LOG IO[%] DURATION[hh:mm:ss] EXECUTIONS COUNT
5/26 - 5/27 0.01 0.26 0 00:02:48.630 444
5/27 - 5/28 0.01 0.22 0 00:02:59.690 404
5/28 - 5/29 0.01 0.14 0 00:01:59.950 420
5/29 - 5/30 0.02 0.33 0 00:03:45.790 456
5/30 - 5/31 0.01 0.19 0 00:02:20.290 475
5/31 - 6/1 0.01 0.23 0 00:04:29.770 461
6/1 - 6/2 0.01 0.23 0 00:02:54.680 376
6/2 - 6/3 0 0.05 0 00:01:36.050 205
6/3 - 6/4 0.02 0.48 0 00:06:05.070 692
6/4 - 6/5 0.02 0.34 0 00:03:40.900 400

Queryplan: https://www.brentozar.com/pastetheplan/?id=SyUSxtbI0

We are running on Azure SQL, so every unnecessary page retrieval (eg: a key lookup for data that is not used by X% of the implementations) is always bad for performance. We scaled our azure sql temporary to mitigate this issue.

AndersAbel commented 1 week ago

Thanks for sharing the numbers. I did some calculations to understand how much of the total DB time is caused by the select. For example the first line, the delete is 580 seconds, the select is 168 seconds. The total operation time is 748. The select makes up 22% of the total DB time.

If I get this right, the select only makes up 11% of the total DB time. I don't think it's worth spending more time optimizing this on the EF abstraction layer.

For high volume/performance deployments we do recommend tuning the database with the features available on the DB engine used and possibly replacing/reworking the queries to be optimal with the specific database.

If you are using SQL Server I think that you could possibly get performance gains by changing the indexes:

  1. Change the clustered index to be the expiry time.
  2. Create a new unique index on the key, which includes all columns.

This would make the delete operations efficient as they would be a range scan on the beginning of the clustered index. The read-by-key operations would be able be completed from the index only, without having to do any lookup on the clustered index. The drawback would be increased storage requirements and probably slower write/insert. As with any DB tuning you would have to test and measure the performance on your specific environment.

AndersAbel commented 1 week ago

I noticed now that the execution plans looks like being the estimated execution plans. Could you please share the actual execution plans as that would be needed to find out if the OpenJSON problem gives a performance issue here.

pregress commented 1 week ago

We will implement our own TokenCleanupService that utilizes the existing indexes.

Regarding the actual execution plan, we don't have the actual plan in query store.

pregress commented 1 week ago

For those needing the implementation.

public class FastTokenCleanUpService : TokenCleanupService
{
    private readonly OperationalStoreOptions _options;
    private readonly IPersistedGrantDbContext _persistedGrantDbContext;
    private readonly ILogger<TokenCleanupService> _logger;

    public FastTokenCleanUpService(OperationalStoreOptions options,
        IPersistedGrantDbContext persistedGrantDbContext,
        ILogger<TokenCleanupService> logger,
        IOperationalStoreNotification operationalStoreNotification = null) : base(options, persistedGrantDbContext, logger, operationalStoreNotification)
    {
        _options = options;
        _persistedGrantDbContext = persistedGrantDbContext;
        _logger = logger;

        if (operationalStoreNotification != null)
        {
            throw new NotSupportedException(
                "FastTokenCleanUpService should not be used in combination with operationalStoreNotification, fall back on the default identity behavior");
        }
    }

    protected override async Task RemoveExpiredPersistedGrantsAsync(CancellationToken cancellationToken = default)
    {
        var found = int.MaxValue;

        while (found >= _options.TokenCleanupBatchSize)
        {
            var query = _persistedGrantDbContext.PersistedGrants
                .Where(x => x.Expiration < DateTime.UtcNow);

            found = await query.Take(_options.TokenCleanupBatchSize)
                    .ExecuteDeleteAsync(cancellationToken);

            _logger.LogInformation("Removed {grantCount} expired grants", found);
        }
    }

    protected override async Task RemoveConsumedPersistedGrantsAsync(CancellationToken cancellationToken = default)
    {
        var found = int.MaxValue;

        var delay = TimeSpan.FromSeconds(_options.ConsumedTokenCleanupDelay);
        var consumedTimeThreshold = DateTime.UtcNow.Subtract(delay);

        while (found >= _options.TokenCleanupBatchSize)
        {
            var query = _persistedGrantDbContext.PersistedGrants
                .Where(x => x.ConsumedTime < consumedTimeThreshold);

            found = await query.Take(_options.TokenCleanupBatchSize)
                    .ExecuteDeleteAsync(cancellationToken);

            _logger.LogInformation("Removed {grantCount} consumed grants", found);
        }
    }

    protected override async Task RemoveDeviceCodesAsync(CancellationToken cancellationToken = default)
    {
        var found = int.MaxValue;

        while (found >= _options.TokenCleanupBatchSize)
        {
            var query = _persistedGrantDbContext.DeviceFlowCodes
                .Where(x => x.Expiration < DateTime.UtcNow);

            found = await query.Take(_options.TokenCleanupBatchSize)
                    .ExecuteDeleteAsync(cancellationToken);

            _logger.LogInformation("Removed {deviceCodeCount} device flow codes", found);
        }
    }
}

// program.cs OR startup.cs
// after AddOperationalStore

    var serviceDescriptor = builder.Services.Single(descriptor => descriptor.ServiceType == typeof(ITokenCleanupService));
    builder.Services.Remove(serviceDescriptor);
    builder.Services.AddTransient<ITokenCleanupService, FastTokenCleanUpService>();