OctopusDeploy / Issues

| Public | Bug reports and known issues for Octopus Deploy and all related tools
https://octopus.com
161 stars 20 forks source link

SQL timeouts when searching events, publishing or loading a runbook snapshot - Execution Timeout Expired #6972

Open adam-mccoy opened 3 years ago

adam-mccoy commented 3 years ago

Prerequisites

The bug

An Octopus instance with many events can experience SQL timeouts when searching for events in particular ways, like by user, project, environment, etc.

Steps to reproduce

This requires an Octopus instance with many millions of events.

  1. Go to Audit under Configuration
  2. Add filters for last 30 days and a user
  3. See error

Screen capture

image

Exception occurred while executing a reader for `SELECT *
FROM (
    SELECT *,
    ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum
    FROM [dbo].[Event]
    WHERE ((([SpaceId] = 'Spaces-1') OR ([SpaceId] = 'Spaces-2') OR ([SpaceId] = 'Spaces-3')))
    AND ([Occurred] >= @from)
    AND ([Occurred] < @to)
    AND (([UserId] in (@userids_1) OR [Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId in (@userids_1))))
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]`
SQL Error -2 - Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Unknown error 258

Affected versions

Octopus Server: 2021.1 onwards

Workarounds for on-premise customers

You need to clear the cached execution plans:

DBCC FREEPROCCACHE

Workarounds for Cloud (Azure SQL) customers

There are a few things that can improve the performance of searching events:

See detailed steps here.

Rebuild EventRelatedDocument indexes

ALTER INDEX [IX_EventRelatedDocument_RelatedDocumentId] ON [dbo].[EventRelatedDocument] REBUILD;
ALTER INDEX [IX_EventRelatedDocument_EventId] ON [dbo].[EventRelatedDocument] REBUILD;

Note: this can take a long time to execute, depending on the number of events.

Links

Internal report: https://trello.com/c/faJFrpEl/37-sql-execution-timeout-on-deploy

N-lson commented 2 years ago

Customer reported experiencing this issue (internal ticket): https://octopus.zendesk.com/agent/tickets/80435

akirayamamoto commented 2 years ago

Another customer experiencing this issue: https://octopus.zendesk.com/agent/tickets/81992

Proof that SQL could have chosen a better existing index:

Specifying an index in a query is an anti-pattern so we need to figure out what's causing SQL to make this poor decision.

Another thing to optimize: the endpoint below generates 2 queries. https://capptivate.octopus.app/api/events?regarding=RunbookSnapshots-8567&spaces=Spaces-1&includeSystem=true

  1. Counts the number of events.
  2. Grabs the data.

We should consider only grabbing the data until we can't find more data. We should not need to count. We could cut this to a single query instead of 2. In addition, COUNT is not cheap.

List of queries from a single endpoint request:

-- this is taking 2s
exec sp_executesql N'SELECT COUNT(*)
FROM [dbo].[Event]
WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))',N'@regarding0 nvarchar(100)',@regarding0=N'RunbookSnapshots-8567'
exec sp_executesql N'SELECT *
FROM (
    SELECT *,
    ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum
    FROM [dbo].[Event]
    WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null))))
    AND ([Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0))
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-8567',@_minrow=1,@_maxrow=30
octoreleasebot commented 2 years ago

Release Note: Fixed SQL timeout on the api/events endpoint.

Octobob commented 2 years ago

:tada: The fix for this issue has been released in:

Release stream Release
2021.3 2021.3.12042
2022.1 2022.1.359
akirayamamoto commented 2 years ago

We are still experiencing this issue (internal): https://octopusdeploy.slack.com/archives/C01HZFJRYSH/p1643844703973029

zentron commented 2 years ago

Additional Fixes supplied for the 2022.1 version from 2022.1.1286 onwards. This was not applied to the 2021.3 range

Release stream | Release -- | -- 2022.2 | 2022.2.0 2022.1 | 2022.1.1286
akirayamamoto commented 2 years ago

Hi @zentron. I would like to keep this issue open for now since my last PR has a feature toggle that is not available to customers yet.

For now please rebuild the indexes as described in the workaround section.

donnybell commented 2 years ago

Another affected customer according to this internal thread

akirayamamoto commented 2 years ago

It might be worth trying the Automatic plan correction.

benPearce1 commented 1 year ago

The affected indexes listed above may be different depending on the version. In 2023.x versions the index IX_EventRelatedDocument_RelatedDocumentIdPrefix may also be affected and the IX_EventRelatedDocument_EventId index no longer exists.

mildmanneredjon commented 7 months ago

We are experiencing the same timeouts when loading a runbook snapshot (running Octopus version 2023.2 (Build 12998). I don't think this is an index issue. This is due to the way the select statement is written and how SQL server engine changes the execution plan when data volumes increase significantly. When executing the select statement directly it now takes over 1 minute to complete. By changing the select statement to either imbedding the select statement from the EventRelatedDocument table first (either in a temp table or table variable), the select statement runs in under 110 milliseconds (temp table runs in 50ms, table variable runs in 110ms). e.g.

exec sp_executesql N'DECLARE @EventRelatedDocument TABLE (EventId NVARCHAR(50), RelatedDocumentId NVARCHAR(250))

INSERT INTO @EventRelatedDocument SELECT EventId, RelatedDocumentId from EventRelatedDocument where RelatedDocumentId = @regarding0

SELECT * FROM ( SELECT E.Id,RelatedDocumentIds,ProjectId,EnvironmentId,Category,UserId,Username,Occurred,Message,TenantId,AutoId,DataVersion,UserAgent,SpaceId,JSONBlob,ChangeDetails,IpAddress,JSON, ROW_NUMBER() OVER (ORDER BY [Occurred] DESC, [AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN @EventRelatedDocument ER ON E.Id = ER.EventId WHERE (((([SpaceId] in (''Spaces-1''))) OR (([SpaceId] is null)))) and E.Id = ER.EventId ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum];',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;

mildmanneredjon commented 6 months ago

Further analysis and testing has provided another solution and that is to eliminate the key lookups caused by the index not having all columns in the INCLUDE. They way to do this is to join the Event table to itself and select the columns from the joined table. Tests have shown this to resolve the long duration of queries hitting runbook snapshots with large volumes of data.

exec sp_executesql N'SELECT * FROM ( SELECT E2.Id,E2.RelatedDocumentIds,E2.ProjectId,E2.EnvironmentId,E2.Category,E2.UserId,E2.Username,E2.Occurred,E2.Message, E2.TenantId,E2.AutoId,E2.DataVersion,E2.UserAgent,E2.SpaceId,E2.JSONBlob,E2.ChangeDetails,E2.IpAddress,E2.JSON, ROW_NUMBER() OVER (ORDER BY E2.[Occurred] DESC, E2.[AutoId] DESC) AS RowNum FROM [dbo].[Event] E JOIN [dbo].[Event] E2 ON E.Id = E2.Id WHERE ((((E.[SpaceId] in (''Spaces-1''))) OR ((E.[SpaceId] is null)))) AND (E.[Id] in (SELECT er.EventId from EventRelatedDocument er where er.RelatedDocumentId = @regarding0)) ) ALIAS_GENERATED_1 WHERE ([RowNum] >= @_minrow) AND ([RowNum] <= @_maxrow) ORDER BY [RowNum]',N'@regarding0 nvarchar(100),@_minrow int,@_maxrow int',@regarding0=N'RunbookSnapshots-5650',@_minrow=1,@_maxrow=30;

LukeButters commented 5 months ago

@mildmanneredjon when the query was taking more than 1 minute was that always occurring or was it only occurring when mssql happens to chose a bad plan.

mildmanneredjon commented 5 months ago

@LukeButters it was only occurring when a bad plan was chosen