Xabaril / AspNetCore.Diagnostics.HealthChecks

Enterprise HealthChecks for ASP.NET Core Diagnostics Package
Apache License 2.0
4.05k stars 793 forks source link

MongoDB HealthCheck Intermittently getting CancellationToken #1137

Closed gcalabro-rli closed 1 year ago

gcalabro-rli commented 2 years ago

I am running into an intermittent Connection Closed issue resulting in a Cancellation Token.

We have an AKS Cluster with many pods. Each pod has a health check leveraging the AspNetCore Diagnostic HeathCheck pipeline.

We are using Mongo Atlas, hosted in Azure, connecting over a private endpoint.

For each pod's health check, we leverage the AspNetCore Diagnostic Health Check Package for MongoDB. This under the hood, connects to Mongo with a connection string pulled from IConfiguration and connects to the Database with the configuration settings provided by the connection string. It then does a simple query, asking the database to return a list of collections within the database. If this is successful it will return a Healthy Result back from the AspNetCore Diagnostics thus to the K8 probe.

For K8 probe here is the configuration:

livenessProbe:
   failureThreshold: 3
   httpGet:
     path: /Heartbeat
     port: 8080
     scheme: HTTP
   initialDelaySeconds: 3
   periodSeconds: 10
   successThreshold: 1
   timeoutSeconds: 15

Intermittently we are getting these errors wears the pod reports back an unhealthy connection, saying the connection has been closed. Below is the stack trace.

System.OperationCanceledException: The operation was canceled. at 
System.Threading.CancellationToken.ThrowOperationCanceledException() at
MongoDB.Driver.Core.Connections.BinaryConnection.ThrowIfCancelledOrDisposedOrNotOpen(CancellationToken cancellationToken) at
MongoDB.Driver.Core.Connections.BinaryConnection.SendMessagesAsync(IEnumerable`1 messages, MessageEncoderSettings messageEncoderSettings, CancellationToken cancellationToken) at
MongoDB.Driver.Core.WireProtocol.CommandUsingQueryMessageWireProtocol`1.ExecuteAsync(IConnection connection, CancellationToken cancellationToken) at
MongoDB.Driver.Core.Connections.HelloHelper.GetResultAsync(IConnection connection, CommandWireProtocol`1 helloProtocol, CancellationToken cancellationToken) at
MongoDB.Driver.Core.Connections.ConnectionInitializer.SendHelloAsync(IConnection connection, CancellationToken cancellationToken) at
MongoDB.Driver.Core.Connections.BinaryConnection.OpenHelperAsync(CancellationToken cancellationToken) at
MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.PooledConnection.OpenAsync(CancellationToken cancellationToken) at
MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.ConnectionCreator.CreateOpenedInternalAsync(CancellationToken cancellationToken) at
MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.ConnectionCreator.CreateOpenedOrReuseAsync(CancellationToken cancellationToken) at
MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.AcquireConnectionHelper.AcquireConnectionAsync(CancellationToken cancellationToken) at
MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.AcquireConnectionAsync(CancellationToken cancellationToken) at
MongoDB.Driver.Core.Servers.Server.GetChannelAsync(CancellationToken cancellationToken) at
MongoDB.Driver.Core.Operations.RetryableReadContext.InitializeAsync(CancellationToken cancellationToken) at
MongoDB.Driver.Core.Operations.RetryableReadContext.CreateAsync(IReadBinding binding, Boolean retryRequested, CancellationToken cancellationToken) at
MongoDB.Driver.Core.Operations.ListCollectionsOperation.ExecuteAsync(IReadBinding binding, CancellationToken cancellationToken) at
MongoDB.Driver.OperationExecutor.ExecuteReadOperationAsync[TResult](IReadBinding binding, IReadOperation`1 operation, CancellationToken cancellationToken) at
MongoDB.Driver.MongoDatabaseImpl.ExecuteReadOperationAsync[T](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken) at
MongoDB.Driver.MongoDatabaseImpl.ListCollectionNamesAsync(IClientSessionHandle session, ListCollectionNamesOptions options, CancellationToken cancellationToken) at
MongoDB.Driver.MongoDatabaseImpl.UsingImplicitSessionAsync[TResult](Func`2 funcAsync, CancellationToken cancellationToken) at
HealthChecks.MongoDb.MongoDbHealthCheck.CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken) in /_/src/HealthChecks.MongoDb/MongoDbHealthCheck.cs:line 57

I don't see any particular pattern or rhyme or reason why this is occurring especially because it's random. Also it happens when there is little to no traffic on the cluster either. So I have ruled out load.

MongoDB Atlas - Azure Hosted Kubernetes 1.22 - AKS NET 6.0

sungam3r commented 2 years ago

ASP NET Core calls DefaultHealthCheckService.CheckHealthAsync with cancellation token equals to httpContext.RequestAborted. My advice is to check server logs for details, increase log level to debug/verbose etc to diagnose.

MigTremblay commented 2 years ago

Same thing here. Things were running fine for months then all of a sudden not a single MongoDB healthcheck has passed since about 2 weeks. The application is able to access the database fine. The healthcheck however started reporting unhealthy status and made our pods restart few hundred times in production...

The healthcheck is configured with a 15 seconds timeout but the message tell us that it decided to cancel after just one second (???). There's no additional log and we have no idea what is happening. We see the healthcheck requests being executed without problem in the MongoDB logs.

We are going to implement a more reliable healthcheck and stop using this one but we would still appreciate if this eventually get fixed so we can switch back.

       services
                ...
                .AddHealthChecks()
                .AddMongoDb(
                    options.ConnectionString,
                    options.Database,
                    "X",
                    timeout: TimeSpan.FromSeconds(15));
Health check X with status Unhealthy completed after 1001.4778ms with message 'null'
System.OperationCanceledException: The operation was canceled.
   at MongoDB.Driver.Core.Misc.StreamExtensionMethods.ReadAsync(Stream stream, Byte[] buffer, Int32 offset, Int32 count, TimeSpan timeout, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Misc.StreamExtensionMethods.ReadBytesAsync(Stream stream, Byte[] buffer, Int32 offset, Int32 count, TimeSpan timeout, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Connections.BinaryConnection.ReceiveBufferAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Connections.BinaryConnection.ReceiveBufferAsync(Int32 responseTo, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Connections.BinaryConnection.Dropbox.RemoveMessage(Int32 responseTo)
   at MongoDB.Driver.Core.Connections.BinaryConnection.ReceiveBufferAsync(Int32 responseTo, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Connections.BinaryConnection.ReceiveMessageAsync(Int32 responseTo, IMessageEncoderSelector encoderSelector, MessageEncoderSettings messageEncoderSettings, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.WireProtocol.CommandUsingQueryMessageWireProtocol`1.ExecuteAsync(IConnection connection, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Connections.HelloHelper.GetResultAsync(IConnection connection, CommandWireProtocol`1 helloProtocol, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Connections.ConnectionInitializer.SendHelloAsync(IConnection connection, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Connections.BinaryConnection.OpenHelperAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.PooledConnection.OpenAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.ConnectionCreator.CreateOpenedInternalAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.ConnectionCreator.CreateOpenedOrReuseAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.AcquireConnectionHelper.AcquireConnectionAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.AcquireConnectionAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Servers.Server.GetChannelAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.RetryableReadContext.InitializeAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.RetryableReadContext.CreateAsync(IReadBinding binding, Boolean retryRequested, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.ListCollectionsOperation.ExecuteAsync(IReadBinding binding, CancellationToken cancellationToken)
   at MongoDB.Driver.OperationExecutor.ExecuteReadOperationAsync[TResult](IReadBinding binding, IReadOperation`1 operation, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoDatabaseImpl.ExecuteReadOperationAsync[T](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoDatabaseImpl.ListCollectionNamesAsync(IClientSessionHandle session, ListCollectionNamesOptions options, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoDatabaseImpl.UsingImplicitSessionAsync[TResult](Func`2 funcAsync, CancellationToken cancellationToken)
   at HealthChecks.MongoDb.MongoDbHealthCheck.CheckHealthAsync(HealthCheckContext context, CancellationToken cancellationToken) in /_/src/HealthChecks.MongoDb/MongoDbHealthCheck.cs:line 57
sungam3r commented 2 years ago

after 1001.4778ms

My intuition tells me that this can be a different timout - 1 second timeout from kubernetes or something else. How do you deploy services? I faced the same problem with cancellation of health check requests due to too small timeout for probes in kubernetes. See MongoDbHealthCheck code, it's rather simple and I see no reason what can go wrong here with this code itself.

MigTremblay commented 2 years ago

1 second timeout from kubernetes

We've had issues in the past with healthchecks that fails on databases that has no collection yet but I think that I blamed the code too quickly this time. Seems like you found what we did wrong. We left the k8s healthcheck timeout to default which appear to be 1 second. In that case, writing our own wouldn't change a thing. I'll give more attention to the healthcheck settings in our k8s deploy and post back my results.

sungam3r commented 1 year ago

@gcalabro-rli Did you manage to solve this issue?

Thato9524 commented 1 year ago

still facing the same issue..is there any update on the fix

sungam3r commented 1 year ago

As @MigTremblay said this issue is most likely about configuring timeout on the caller side. Closing.