Open adam-mccoy opened 3 years ago
Customer reported experiencing this issue (internal ticket): https://octopus.zendesk.com/agent/tickets/80435
Another customer experiencing this issue: https://octopus.zendesk.com/agent/tickets/81992
Proof that SQL could have chosen a better existing index:
Specifying an index in a query is an anti-pattern so we need to figure out what's causing SQL to make this poor decision.
Another thing to optimize: the endpoint below generates 2 queries. https://capptivate.octopus.app/api/events?regarding=RunbookSnapshots-8567&spaces=Spaces-1&includeSystem=true
We should consider only grabbing the data until we can't find more data. We should not need to count. We could cut this to a single query instead of 2. In addition, COUNT
is not cheap.
List of queries from a single endpoint request:
-- this is taking 2s
exec sp_executesql N'SELECT COUNT(*)
FROM [dbo].[Event]
WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))',N'@regarding0 nvarchar(100)',@regarding0=N'RunbookSnapshots-8567'
exec sp_executesql N'SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum
FROM [dbo].[Event]
WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-8567',@_minrow=1,@_maxrow=30
Release Note: Fixed SQL timeout on the api/events
endpoint.
:tada: The fix for this issue has been released in:
Release stream | Release |
---|---|
2021.3 | 2021.3.12042 |
2022.1 | 2022.1.359 |
We are still experiencing this issue (internal): https://octopusdeploy.slack.com/archives/C01HZFJRYSH/p1643844703973029
Additional Fixes supplied for the 2022.1 version from 2022.1.1286
onwards. This was not applied to the 2021.3
range
Hi @zentron. I would like to keep this issue open for now since my last PR has a feature toggle that is not available to customers yet.
For now please rebuild the indexes as described in the workaround section.
Another affected customer according to this internal thread
It might be worth trying the Automatic plan correction.
The affected indexes listed above may be different depending on the version.
In 2023.x versions the index IX_EventRelatedDocument_RelatedDocumentIdPrefix
may also be affected and the IX_EventRelatedDocument_EventId
index no longer exists.
We are experiencing the same timeouts when loading a runbook snapshot (running Octopus version 2023.2 (Build 12998). I don't think this is an index issue. This is due to the way the select statement is written and how SQL server engine changes the execution plan when data volumes increase significantly. When executing the select statement directly it now takes over 1 minute to complete. By changing the select statement to either imbedding the select statement from the EventRelatedDocument table first (either in a temp table or table variable), the select statement runs in under 110 milliseconds (temp table runs in 50ms, table variable runs in 110ms). e.g.
exec sp_executesql N'DECLARE @EventRelatedDocument TABLE (EventId NVARCHAR(50), RelatedDocumentId NVARCHAR(250))
INSERT INTO @EventRelatedDocument SELECT EventId, RelatedDocumentId from EventRelatedDocument where RelatedDocumentId = @regarding0
SELECT * FROM ( SELECT E.Id,RelatedDocumentIds,ProjectId,EnvironmentId,Category,UserId,Username,Occurred,Message,TenantId,AutoId,DataVersion,UserAgent,SpaceId,JSONBlob,ChangeDetails,IpAddress,JSON, ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN @EventRelatedDocument ER ON E.Id = ER.EventId WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null)))) and E.Id = ER.EventId ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum];',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;
Further analysis and testing has provided another solution and that is to eliminate the key lookups caused by the index not having all columns in the INCLUDE. They way to do this is to join the Event table to itself and select the columns from the joined table. Tests have shown this to resolve the long duration of queries hitting runbook snapshots with large volumes of data.
exec sp_executesql N'SELECT * FROM ( SELECT E2.Id,E2.RelatedDocumentIds,E2.ProjectId,E2.EnvironmentId,E2.Category,E2.UserId,E2.Username,E2.Occurred,E2.Message, E2.TenantId,E2.AutoId,E2.DataVersion,E2.UserAgent,E2.SpaceId,E2.JSONBlob,E2.ChangeDetails,E2.IpAddress,E2.JSON, ROW_NUMBER() OVER (ORDER BY E2.[Occurred] DESC, E2.[AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN [dbo].[Event] E2 ON E.Id = E2.Id WHERE ((((E.[SpaceId] in (''Spaces-1''))) OR ((E.[SpaceId] is null)))) AND (E.[Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0)) ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;
@mildmanneredjon when the query was taking more than 1 minute was that always occurring or was it only occurring when mssql happens to chose a bad plan.
@LukeButters it was only occurring when a bad plan was chosen
Prerequisites
The bug
An Octopus instance with many events can experience SQL timeouts when searching for events in particular ways, like by user, project, environment, etc.
Steps to reproduce
This requires an Octopus instance with many millions of events.
Screen capture
Affected versions
Octopus Server: 2021.1 onwards
Workarounds for on-premise customers
You need to clear the cached execution plans:
Workarounds for Cloud (Azure SQL) customers
There are a few things that can improve the performance of searching events:
EventRelatedDocument
tables, see example script belowSee detailed steps here.
Rebuild
EventRelatedDocument
indexesNote: this can take a long time to execute, depending on the number of events.
Links
Internal report: https://trello.com/c/faJFrpEl/37-sql-execution-timeout-on-deploy