adam-mccoy commented 3 years ago

Prerequisites

[x] I have verified the problem exists in the latest version
[x] I have searched open and closed issues to make sure it isn't already reported
[x] I have written a descriptive issue title
[x] I have linked the original source of this report
[x] I have labelled the value stream (area/core, area/steps, ...)
[x] I have tagged the issue appropriately (kind/bug, kind/enhancement, feature/ ...)

The bug

An Octopus instance with many events can experience SQL timeouts when searching for events in particular ways, like by user, project, environment, etc.

Steps to reproduce

This requires an Octopus instance with many millions of events.

Go to Audit under Configuration
Add filters for last 30 days and a user
See error

Screen capture

Exception occurred while executing a reader for `SELECT *
FROM (
    SELECT *,
    ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum
    FROM [dbo].[Event]
    WHERE ((([SpaceId] = 'Spaces-1') OR ([SpaceId] = 'Spaces-2') OR ([SpaceId] = 'Spaces-3')))
    AND ([Occurred] >= @from)
    AND ([Occurred] < @to)
    AND (([UserId] in (@userids_1) OR [Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId in (@userids_1))))
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]`
SQL Error -2 - Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Unknown error 258

Affected versions

Octopus Server: 2021.1 onwards

Workarounds for on-premise customers

You need to clear the cached execution plans:

DBCC FREEPROCCACHE

Workarounds for Cloud (Azure SQL) customers

There are a few things that can improve the performance of searching events:

Add more compute power to the SQL Server instance, if possible
Rebuild the indexes on the EventRelatedDocument tables, see example script below
Re-running the query several times can allow the query to complete within the timeout limit due to internal caching done by SQL Server

See detailed steps here.

Rebuild `EventRelatedDocument` indexes

ALTER INDEX [IX_EventRelatedDocument_RelatedDocumentId] ON [dbo].[EventRelatedDocument] REBUILD;
ALTER INDEX [IX_EventRelatedDocument_EventId] ON [dbo].[EventRelatedDocument] REBUILD;

Note: this can take a long time to execute, depending on the number of events.

Links

Internal report: https://trello.com/c/faJFrpEl/37-sql-execution-timeout-on-deploy

N-lson commented 2 years ago

Customer reported experiencing this issue (internal ticket): https://octopus.zendesk.com/agent/tickets/80435

akirayamamoto commented 2 years ago

Another customer experiencing this issue: https://octopus.zendesk.com/agent/tickets/81992

Proof that SQL could have chosen a better existing index:

Not specifying an index times out at 30s: https://cloud.octofront.com/background-tasks/BackgroundTasks-1843049
Specifying an index finishes in under 1s: https://cloud.octofront.com/background-tasks/BackgroundTasks-1843052
Not specifying an index but just adding a new comment so SQL will compile a new plan, under 1s: https://cloud.octofront.com/background-tasks/BackgroundTasks-1844526

Specifying an index in a query is an anti-pattern so we need to figure out what's causing SQL to make this poor decision.

Another thing to optimize: the endpoint below generates 2 queries. https://capptivate.octopus.app/api/events?regarding=RunbookSnapshots-8567&spaces=Spaces-1&includeSystem=true

Counts the number of events.
Grabs the data.

We should consider only grabbing the data until we can't find more data. We should not need to count. We could cut this to a single query instead of 2. In addition, COUNT is not cheap.

List of queries from a single endpoint request:

-- this is taking 2s
exec sp_executesql N'SELECT COUNT(*)
FROM [dbo].[Event]
WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))',N'@regarding0 nvarchar(100)',@regarding0=N'RunbookSnapshots-8567'

exec sp_executesql N'SELECT *
FROM (
    SELECT *,
    ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum
    FROM [dbo].[Event]
    WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
    AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-8567',@_minrow=1,@_maxrow=30

octoreleasebot commented 2 years ago

Release Note: Fixed SQL timeout on the api/events endpoint.

Octobob commented 2 years ago

:tada: The fix for this issue has been released in:

Release stream	Release
2021.3	2021.3.12042
2022.1	2022.1.359

akirayamamoto commented 2 years ago

We are still experiencing this issue (internal): https://octopusdeploy.slack.com/archives/C01HZFJRYSH/p1643844703973029

zentron commented 2 years ago

Additional Fixes supplied for the 2022.1 version from 2022.1.1286 onwards. This was not applied to the 2021.3 range

Release stream | Release -- | -- 2022.2 | 2022.2.0 2022.1 | 2022.1.1286

akirayamamoto commented 2 years ago

Hi @zentron. I would like to keep this issue open for now since my last PR has a feature toggle that is not available to customers yet.

For now please rebuild the indexes as described in the workaround section.

donnybell commented 2 years ago

Another affected customer according to this internal thread

akirayamamoto commented 2 years ago

It might be worth trying the Automatic plan correction.

benPearce1 commented 1 year ago

The affected indexes listed above may be different depending on the version. In 2023.x versions the index IX_EventRelatedDocument_RelatedDocumentIdPrefix may also be affected and the IX_EventRelatedDocument_EventId index no longer exists.

mildmanneredjon commented 7 months ago

We are experiencing the same timeouts when loading a runbook snapshot (running Octopus version 2023.2 (Build 12998). I don't think this is an index issue. This is due to the way the select statement is written and how SQL server engine changes the execution plan when data volumes increase significantly. When executing the select statement directly it now takes over 1 minute to complete. By changing the select statement to either imbedding the select statement from the EventRelatedDocument table first (either in a temp table or table variable), the select statement runs in under 110 milliseconds (temp table runs in 50ms, table variable runs in 110ms). e.g.

exec sp_executesql N'DECLARE @EventRelatedDocument TABLE (EventId NVARCHAR(50), RelatedDocumentId NVARCHAR(250))

INSERT INTO @EventRelatedDocument SELECT EventId, RelatedDocumentId from EventRelatedDocument where RelatedDocumentId = @regarding0

SELECT * FROM ( SELECT E.Id,RelatedDocumentIds,ProjectId,EnvironmentId,Category,UserId,Username,Occurred,Message,TenantId,AutoId,DataVersion,UserAgent,SpaceId,JSONBlob,ChangeDetails,IpAddress,JSON, ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN @EventRelatedDocument ER ON E.Id = ER.EventId WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null)))) and E.Id = ER.EventId ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum];',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;

mildmanneredjon commented 6 months ago

Further analysis and testing has provided another solution and that is to eliminate the key lookups caused by the index not having all columns in the INCLUDE. They way to do this is to join the Event table to itself and select the columns from the joined table. Tests have shown this to resolve the long duration of queries hitting runbook snapshots with large volumes of data.

exec sp_executesql N'SELECT * FROM ( SELECT E2.Id,E2.RelatedDocumentIds,E2.ProjectId,E2.EnvironmentId,E2.Category,E2.UserId,E2.Username,E2.Occurred,E2.Message, E2.TenantId,E2.AutoId,E2.DataVersion,E2.UserAgent,E2.SpaceId,E2.JSONBlob,E2.ChangeDetails,E2.IpAddress,E2.JSON, ROW_NUMBER() OVER (ORDER BY E2.[Occurred] DESC, E2.[AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN [dbo].[Event] E2 ON E.Id = E2.Id WHERE ((((E.[SpaceId] in (''Spaces-1''))) OR ((E.[SpaceId] is null)))) AND (E.[Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0)) ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;

LukeButters commented 5 months ago

@mildmanneredjon when the query was taking more than 1 minute was that always occurring or was it only occurring when mssql happens to chose a bad plan.

mildmanneredjon commented 5 months ago

@LukeButters it was only occurring when a bad plan was chosen

OctopusDeploy / Issues

SQL timeouts when searching events, publishing or loading a runbook snapshot - Execution Timeout Expired #6972

Prerequisites

The bug

Steps to reproduce

Screen capture

Affected versions

Workarounds for on-premise customers

Workarounds for Cloud (Azure SQL) customers

Rebuild `EventRelatedDocument` indexes

Links

OctopusDeploy / Issues

SQL timeouts when searching events, publishing or loading a runbook snapshot - Execution Timeout Expired #6972

Prerequisites

The bug

Steps to reproduce

Screen capture

Affected versions

Workarounds for on-premise customers

Workarounds for Cloud (Azure SQL) customers

Rebuild EventRelatedDocument indexes

Links

Rebuild `EventRelatedDocument` indexes