Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
51 stars 47 forks source link

Improve log messages when the instance is stopping and custom checks are cancelled #3519

Open ramonsmits opened 1 year ago

ramonsmits commented 1 year ago

Describe the bug

Description

Many TaskCanceledException but unclear why these are logged. Maybe this happens during shutdown but then these log entries should NOT be written and TaskCanceledException gracefully handled.

Expected behavior

Improved logging that shows WHY these log entries are canceled. What CancellationTokenSource is responsible for cancellation.

Actual behavior

Many TaskCanceledException but unclear why these are logged.

Versions

Unclear but likely 4.30.0

Please list the version of the relevant packages or applications in which the bug exists.

Steps to reproduce

Unknown

Relevant log output

2023-05-12 00:21:11.9470|70|Error|NServiceBus.CustomChecks.TimerBasedPeriodicCheck|'ServiceControl.CheckRavenDBIndexLag' implementation failed to run.
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Raven.Client.Http.RequestExecutor.<SendAsync>d__117`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1065
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<SendRequestToServer>d__116`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1009
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<ExecuteAsync>d__114`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 866
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.Operations.MaintenanceOperationExecutor.<SendAsync>d__13`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\Operations\MaintenanceOperationExecutor.cs:line 62
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.CheckRavenDBIndexLag.<PerformCheck>d__1.MoveNext() in /_/src/ServiceControl.Audit.Persistence.RavenDb5/CustomChecks/CheckRavenDBIndexLag.cs:line 25
--- End of stack trace from previous location where exception was thrown ---
   at NServiceBus.CustomChecks.TimerBasedPeriodicCheck.<Run>d__4.MoveNext()
2023-05-12 01:40:52.9331|78|Warn|ServiceControl.Audit.Auditing.AuditPersister|Bulk insertion failed
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Raven.Client.Http.RequestExecutor.<SendAsync>d__117`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1065
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<SendRequestToServer>d__116`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1009
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<ExecuteAsync>d__114`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 866
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<ExecuteAsync>d__60`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 590
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<WaitForId>d__36.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 311
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<ExecuteBeforeStore>d__49.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 499
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<StoreAsync>d__41.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 355
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<StoreAsync>d__40.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 344
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.Audit.Persistence.RavenDb.UnitOfWork.RavenDbAuditIngestionUnitOfWork.<RecordProcessedMessage>d__4.MoveNext() in /_/src/ServiceControl.Audit.Persistence.RavenDb5/UnitOfWork/RavenDbAuditIngestionUnitOfWork.cs:line 42
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.Audit.Auditing.AuditPersister.<Persist>d__1.MoveNext() in /_/src/ServiceControl.Audit/Auditing/AuditPersister.cs:line 88
2023-05-12 01:40:52.9331|78|Info|ServiceControl.Audit.Auditing.AuditIngestion|Ingesting messages failed
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Raven.Client.Http.RequestExecutor.<SendAsync>d__117`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1065
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<SendRequestToServer>d__116`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1009
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<ExecuteAsync>d__114`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 866
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<ExecuteAsync>d__60`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 590
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<WaitForId>d__36.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 311
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<ExecuteBeforeStore>d__49.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 499
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<StoreAsync>d__41.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 355
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.BulkInsert.BulkInsertOperation.<StoreAsync>d__40.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\BulkInsert\BulkInsertOperation.cs:line 344
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.Audit.Persistence.RavenDb.UnitOfWork.RavenDbAuditIngestionUnitOfWork.<RecordProcessedMessage>d__4.MoveNext() in /_/src/ServiceControl.Audit.Persistence.RavenDb5/UnitOfWork/RavenDbAuditIngestionUnitOfWork.cs:line 42
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.Audit.Auditing.AuditPersister.<Persist>d__1.MoveNext() in /_/src/ServiceControl.Audit/Auditing/AuditPersister.cs:line 129
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.Audit.Auditing.AuditPersister.<Persist>d__1.MoveNext() in /_/src/ServiceControl.Audit/Auditing/AuditPersister.cs:line 167
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.Audit.Auditing.AuditIngestor.<Ingest>d__2.MoveNext() in /_/src/ServiceControl.Audit/Auditing/AuditIngestor.cs:line 61
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.Audit.Auditing.AuditIngestion.<Loop>d__9.MoveNext() in /_/src/ServiceControl.Audit/Auditing/AuditIngestion.cs:line 199
2023-05-12 02:26:11.3073|85|Error|NServiceBus.CustomChecks.TimerBasedPeriodicCheck|'ServiceControl.Audit.Auditing.FailedAuditImportCustomCheck' implementation failed to run.
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Raven.Client.Http.RequestExecutor.<SendAsync>d__117`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1065
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<SendRequestToServer>d__116`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1009
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<ExecuteAsync>d__114`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 866
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.Session.AsyncDocumentQuery`1.<ExecuteActualQueryAsync>d__142.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs:line 1082
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.Session.AsyncDocumentQuery`1.<InitAsync>d__141.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs:line 1072
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.Session.AsyncDocumentQuery`1.<GetQueryResultAsync>d__140.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs:line 1061
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.Session.AsyncDocumentQuery`1.<Raven-Client-Documents-Session-IAsyncDocumentQueryBase<T>-CountAsync>d__138.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\Session\AsyncDocumentQuery.cs:line 1046
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.Audit.Persistence.RavenDb.RavenDbFailedAuditStorage.<GetFailedAuditsCount>d__4.MoveNext() in /_/src/ServiceControl.Audit.Persistence.RavenDb5/RavenDbFailedAuditStorage.cs:line 84
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.Audit.Auditing.FailedAuditImportCustomCheck.<PerformCheck>d__1.MoveNext() in /_/src/ServiceControl.Audit/Auditing/FailedAuditImportCustomCheck.cs:line 19
--- End of stack trace from previous location where exception was thrown ---
   at NServiceBus.CustomChecks.TimerBasedPeriodicCheck.<Run>d__4.MoveNext()
2023-05-12 12:26:12.7196|52|Error|NServiceBus.CustomChecks.TimerBasedPeriodicCheck|'ServiceControl.CheckRavenDBIndexLag' implementation failed to run.
System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at Raven.Client.Http.RequestExecutor.<SendAsync>d__117`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1065
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<SendRequestToServer>d__116`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 1009
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Http.RequestExecutor.<ExecuteAsync>d__114`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Http\RequestExecutor.cs:line 866
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Client.Documents.Operations.MaintenanceOperationExecutor.<SendAsync>d__13`1.MoveNext() in C:\Builds\RavenDB-Stable-5.4\54038\src\Raven.Client\Documents\Operations\MaintenanceOperationExecutor.cs:line 62
--- End of stack trace from previous location where exception was thrown ---
   at ServiceControl.CheckRavenDBIndexLag.<PerformCheck>d__1.MoveNext() in /_/src/ServiceControl.Audit.Persistence.RavenDb5/CustomChecks/CheckRavenDBIndexLag.cs:line 25
--- End of stack trace from previous location where exception was thrown ---
   at NServiceBus.CustomChecks.TimerBasedPeriodicCheck.<Run>d__4.MoveNext()

Additional Information

Workarounds

Possible solutions

Additional information

ramonsmits commented 1 year ago

When a shutdown is initialized after this the shutdown will timeout after 30 seconds

The message pump failed to stop with in the time allowed(30s)

2023-05-23 08:46:34.0837|14|Error|NServiceBus.CustomChecks.TimerBasedPeriodicCheck|'ServiceControl.CheckRavenDBIndexLag' implementation failed to run.
System.Threading.Tasks.TaskCanceledException: A task was canceled.
2023-05-23 08:54:53.2517|7|Warn|ServiceControl.Audit.Auditing.AuditPersister|Bulk insertion failed
System.Threading.Tasks.TaskCanceledException: A task was canceled.
2023-05-23 08:54:53.3611|7|Info|ServiceControl.Audit.Auditing.AuditIngestion|Ingesting messages failed
System.Threading.Tasks.TaskCanceledException: A task was canceled.
2023-05-23 09:31:34.1249|11|Error|NServiceBus.CustomChecks.TimerBasedPeriodicCheck|'ServiceControl.Audit.Auditing.FailedAuditImportCustomCheck' implementation failed to run.
System.Threading.Tasks.TaskCanceledException: A task was canceled.
2023-05-23 10:44:40.3732|10|Info|Microsoft.Hosting.Lifetime|Application is shutting down...
2023-05-23 10:44:40.3732|16|Info|ServiceControl.Audit.Auditing.AuditIngestion|Shutting down. Start/stop semaphore acquiring
2023-05-23 10:44:40.3732|16|Info|ServiceControl.Audit.Auditing.AuditIngestion|Shutting down. Start/stop semaphore acquired
2023-05-23 10:44:40.3732|16|Info|ServiceControl.Audit.Auditing.AuditIngestion|Shutting down. Infrastructure shut down commencing
2023-05-23 10:44:40.3732|16|Info|NServiceBus.Raw.RunningRawEndpointInstance|Stopping receiver.
2023-05-23 10:45:10.4110|16|Error|NServiceBus.Transport.Msmq.MessagePump|The message pump failed to stop with in the time allowed(30s)
2023-05-23 10:45:10.4110|16|Info|NServiceBus.Raw.RunningRawEndpointInstance|Receiver stopped.
2023-05-23 10:45:10.4110|16|Info|NServiceBus.Raw.StoppableRawEndpoint|Initiating shutdown.
2023-05-23 10:45:10.4110|16|Info|NServiceBus.Raw.StoppableRawEndpoint|Shutdown complete.
2023-05-23 10:45:10.4110|16|Info|ServiceControl.Audit.Auditing.AuditIngestion|Shutting down. Infrastructure shut down completed
2023-05-23 10:45:10.4110|16|Info|ServiceControl.Audit.Auditing.AuditIngestion|Shutting down. Start/stop semaphore releasing
2023-05-23 10:45:10.4110|16|Info|ServiceControl.Audit.Auditing.AuditIngestion|Shutting down. Start/stop semaphore released
jpalac commented 1 year ago

The audit instance uses a custom CustomCheckManager which was not handling OperationCancelled exceptions - this was rectified in https://github.com/Particular/ServiceControl/pull/3602.

The audit instance uses NServiceBus.CustomChecks v3. In that version, the custom check runner does not handle OperationCancelled exception hence we're seeing occasional errors for some of the audit instance custom checks when the instance is being shutdown during their run. In v4 of NServiceBus.CustomChecks the problem is fixed hence when we upgrade SC to use NSB 8 the main instance will be sorted as well.

ramonsmits commented 1 year ago

@jpalac @tmasternak so this issue can still happen in the log files of SC? Can't we just add a try catch in the current custom checks code itself then?

The issue is that otherwise the log still can have this log entries until we upgrade? Log shouldn't contain such entries so it feels as this was closed to fast as we don't know when SC will be upgraded to v8.

tmasternak commented 1 year ago

@ramonsmits This is not going to happen for the main instance (this has been fixed). With the audit instance, you are correct, that we would need to tweak each custom check there exists to prevent this from happening.

we don't know when SC will be upgraded to v8.

Our assumption was that we should do this when we provide support for contenerization.

ramonsmits commented 1 year ago

I'm reopening this so that either catching of OCE's is done in the custom tasks or when all instances have upgraded to Core v8.