Closed pregress closed 1 week ago
Thank you for sharing the details of the performance of this query. We rewrote the token cleanup for v7 with the intention to fix race conditions and also make it more performant.
But we do have some limitations:
But these requires a change to the public api of IOperationalStoreNotification since this works with an IEnumerable
We are very careful when introducing breaking changes. We considered the fix you propose to not have to load the full entities, but decided that it didn't motivate a breaking change. The fix we considered was to check if there are any subscribers to the notifications and if not, simply not load the entities at all.
It is very valuable for us to get the actual performance numbers and the query plan details. The conclusions I make from this is that:
Do you have any actual durations for this, i.e. what is the query duration in milliseconds?
Any optimizations to this query would also have to take the next query into account. This query will bring all the relevant database pages into RAM and CPU cache. The next query deletes the data and that operation requires access to the database pages too. That next query would benefit from the database pages already being in RAM/cache memory.
Regarding the OpenJSON issue that is really not an IdentityServer issue. It is an incompatibility between EF Core and SQL Server. If it would be a major performance issue we would still have to handle it. But our estimate this far has been that the way our queries are written the OpenJSON row estimate issue should not be an issue.
If you have a query plan for the deletion query that would be valuable to see to understand how OpenJSON is handled.
I agree, the query uses the index, and this is efficient, but the key look up make the query plan suboptimal.
Here are the performance metrics for:
Delete expired with subquery
(@__First_Expiration_0 datetime2(7),@__Last_Expiration_1 datetime2(7),@__foundIds_2 nvarchar(4000))DELETE FROM [p]
FROM [PersistedGrants] AS [p]
WHERE [p].[Id] IN (
SELECT [p0].[Id]
FROM [PersistedGrants] AS [p0]
WHERE [p0].[Expiration] < GETUTCDATE() AND [p0].[Expiration] >= @__First_Expiration_0 AND [p0].[Expiration] <= @__Last_Expiration_1 AND [p0].[Id] IN (
SELECT [f].[value]
FROM OPENJSON(@__foundIds_2) WITH ([value] bigint '$') AS [f]
)
)
INTERVAL | CPU[%] | DATA IO[%] | LOG IO[%] | DURATION[hh:mm:ss] | EXECUTIONS COUNT |
---|---|---|---|---|---|
5/26 - 5/27 | 0.06 | 0.61 | 0.17 | 00:09:40.120 | 444 |
5/27 - 5/28 | 0.06 | 0.55 | 0.16 | 00:13:57.300 | 403 |
5/28 - 5/29 | 0.07 | 0.7 | 0.16 | 00:14:22.730 | 420 |
5/29 - 5/30 | 0.08 | 0.73 | 0.18 | 00:14:30.200 | 455 |
5/30 - 5/31 | 0.09 | 0.98 | 0.19 | 00:19:15.630 | 474 |
5/31 - 6/1 | 0.09 | 1.12 | 0.19 | 00:43:22.420 | 460 |
6/1 - 6/2 | 0.08 | 0.83 | 0.15 | 00:21:14.860 | 374 |
6/2 - 6/3 | 0.06 | 0.82 | 0.08 | 00:51:13.220 | 203 |
6/3 - 6/4 | 0.11 | 1.14 | 0.27 | 00:29:57.910 | 692 |
Here is the execution plan for the actual delete query: https://www.brentozar.com/pastetheplan/?id=B1SS51dL0 Also containing a table spool although it's only a 6% cost, I don't think there is anything wrong with the actual delete. Writing is expensive on slow azure storage. https://erikdarling.com/optimizing-sql-server-query-plans-with-table-spools-in-them/
Select expired records (with expensive keylookup) INTERVAL | CPU[%] | DATA IO[%] | LOG IO[%] | DURATION[hh:mm:ss] | EXECUTIONS COUNT |
---|---|---|---|---|---|
5/26 - 5/27 | 0.01 | 0.26 | 0 | 00:02:48.630 | 444 |
5/27 - 5/28 | 0.01 | 0.22 | 0 | 00:02:59.690 | 404 |
5/28 - 5/29 | 0.01 | 0.14 | 0 | 00:01:59.950 | 420 |
5/29 - 5/30 | 0.02 | 0.33 | 0 | 00:03:45.790 | 456 |
5/30 - 5/31 | 0.01 | 0.19 | 0 | 00:02:20.290 | 475 |
5/31 - 6/1 | 0.01 | 0.23 | 0 | 00:04:29.770 | 461 |
6/1 - 6/2 | 0.01 | 0.23 | 0 | 00:02:54.680 | 376 |
6/2 - 6/3 | 0 | 0.05 | 0 | 00:01:36.050 | 205 |
6/3 - 6/4 | 0.02 | 0.48 | 0 | 00:06:05.070 | 692 |
6/4 - 6/5 | 0.02 | 0.34 | 0 | 00:03:40.900 | 400 |
Queryplan: https://www.brentozar.com/pastetheplan/?id=SyUSxtbI0
We are running on Azure SQL, so every unnecessary page retrieval (eg: a key lookup for data that is not used by X% of the implementations) is always bad for performance. We scaled our azure sql temporary to mitigate this issue.
Thanks for sharing the numbers. I did some calculations to understand how much of the total DB time is caused by the select. For example the first line, the delete is 580 seconds, the select is 168 seconds. The total operation time is 748. The select makes up 22% of the total DB time.
If I get this right, the select only makes up 11% of the total DB time. I don't think it's worth spending more time optimizing this on the EF abstraction layer.
For high volume/performance deployments we do recommend tuning the database with the features available on the DB engine used and possibly replacing/reworking the queries to be optimal with the specific database.
If you are using SQL Server I think that you could possibly get performance gains by changing the indexes:
This would make the delete operations efficient as they would be a range scan on the beginning of the clustered index. The read-by-key operations would be able be completed from the index only, without having to do any lookup on the clustered index. The drawback would be increased storage requirements and probably slower write/insert. As with any DB tuning you would have to test and measure the performance on your specific environment.
I noticed now that the execution plans looks like being the estimated execution plans. Could you please share the actual execution plans as that would be needed to find out if the OpenJSON problem gives a performance issue here.
We will implement our own TokenCleanupService that utilizes the existing indexes.
Regarding the actual execution plan, we don't have the actual plan in query store.
For those needing the implementation.
public class FastTokenCleanUpService : TokenCleanupService
{
private readonly OperationalStoreOptions _options;
private readonly IPersistedGrantDbContext _persistedGrantDbContext;
private readonly ILogger<TokenCleanupService> _logger;
public FastTokenCleanUpService(OperationalStoreOptions options,
IPersistedGrantDbContext persistedGrantDbContext,
ILogger<TokenCleanupService> logger,
IOperationalStoreNotification operationalStoreNotification = null) : base(options, persistedGrantDbContext, logger, operationalStoreNotification)
{
_options = options;
_persistedGrantDbContext = persistedGrantDbContext;
_logger = logger;
if (operationalStoreNotification != null)
{
throw new NotSupportedException(
"FastTokenCleanUpService should not be used in combination with operationalStoreNotification, fall back on the default identity behavior");
}
}
protected override async Task RemoveExpiredPersistedGrantsAsync(CancellationToken cancellationToken = default)
{
var found = int.MaxValue;
while (found >= _options.TokenCleanupBatchSize)
{
var query = _persistedGrantDbContext.PersistedGrants
.Where(x => x.Expiration < DateTime.UtcNow);
found = await query.Take(_options.TokenCleanupBatchSize)
.ExecuteDeleteAsync(cancellationToken);
_logger.LogInformation("Removed {grantCount} expired grants", found);
}
}
protected override async Task RemoveConsumedPersistedGrantsAsync(CancellationToken cancellationToken = default)
{
var found = int.MaxValue;
var delay = TimeSpan.FromSeconds(_options.ConsumedTokenCleanupDelay);
var consumedTimeThreshold = DateTime.UtcNow.Subtract(delay);
while (found >= _options.TokenCleanupBatchSize)
{
var query = _persistedGrantDbContext.PersistedGrants
.Where(x => x.ConsumedTime < consumedTimeThreshold);
found = await query.Take(_options.TokenCleanupBatchSize)
.ExecuteDeleteAsync(cancellationToken);
_logger.LogInformation("Removed {grantCount} consumed grants", found);
}
}
protected override async Task RemoveDeviceCodesAsync(CancellationToken cancellationToken = default)
{
var found = int.MaxValue;
while (found >= _options.TokenCleanupBatchSize)
{
var query = _persistedGrantDbContext.DeviceFlowCodes
.Where(x => x.Expiration < DateTime.UtcNow);
found = await query.Take(_options.TokenCleanupBatchSize)
.ExecuteDeleteAsync(cancellationToken);
_logger.LogInformation("Removed {deviceCodeCount} device flow codes", found);
}
}
}
// program.cs OR startup.cs
// after AddOperationalStore
var serviceDescriptor = builder.Services.Single(descriptor => descriptor.ServiceType == typeof(ITokenCleanupService));
builder.Services.Remove(serviceDescriptor);
builder.Services.AddTransient<ITokenCleanupService, FastTokenCleanUpService>();
Which version of Duende IdentityServer are you using? 7.0.5
Which version of .NET are you using? .NET 8
Describe the bug
https://github.com/DuendeSoftware/IdentityServer/blob/e9860c6488f90e8fbc11a4452b9dd111dbfae933/src/EntityFramework.Storage/TokenCleanup/TokenCleanupService.cs#L92
Fetches the whole PersistedGrant object for a cleanup, This causes very expensive key lookups. See execution plan: https://www.brentozar.com/pastetheplan/?id=SyUSxtbI0 This could be fixed by using the existing index correctly
IX_PersistedGrants_Expiration
if only Id & Expiration are returned. But these requires a change to the public api ofIOperationalStoreNotification
since this works with anIEnumerable<PersistedGrant>
The fix would be:
To Reproduce
Expected behavior
Cleanup should not produce an execution plan with a key lookup.
Log output/exception with stacktrace N/A
Additional context Adding include columns to the existing index would also solve this, but then it's just a copy of the whole table.
Query plan also contains the openjson (https://github.com/DuendeSoftware/IdentityServer/issues/1564) which also screws with the estimated number of rows..., while the estimated number of rows should always be equal to the batch size. In our case it's not a big deal, but could imagine that this would go wrong.