dotnet / aspire

Tools, templates, and packages to accelerate building observable, production-ready apps
https://learn.microsoft.com/dotnet/aspire
MIT License
3.85k stars 462 forks source link

Stuck when waiting in EnvironmentCallbackAnnotation for a health check #6613

Open ealeykin opened 1 week ago

ealeykin commented 1 week ago

Is there an existing issue for this?

Describe the bug

I have an Aspire app running two hosts and three containers:

1. Kafka+UI
2. RabbitMQ+UI
3. SqlServer
4. DbUp host
5. Application host

The application host is waiting for DbUp completion and also Kafka/RabbitMq health checks, DbUp host waits for SqlServer health check only.

So the issue is with DbUp - it's not started at all sometimes, so the Aspire app is stuck. The output is as following:

Waiting for Kafka ...
KafkaReady
Waiting for RabbitMq ...
RabbitMq is ready
Waiting for SqlServer ...
SqlServer is ready

Now DbUp supposed to run, SqlServer is ready - but it doesn't. It's not 100% reproducible all the time, just every 4th or 5th run.

Since DbUp depends only on SqlServer health check, I'm posting here custom health check for SqlServer added to DbUp resource (but others are similar).

    public static IResourceBuilder<T> WaitForSqlServer<T>(
        this IResourceBuilder<T> builder, 
        IResourceBuilder<SqlServerServerResource> sqlServerBuilder) where T : IResource
    {
        return builder.WithAnnotation(new EnvironmentCallbackAnnotation(async context =>
        {
            var loggerFactory = context.ExecutionContext.ServiceProvider.GetRequiredService<ILoggerFactory>();
            var logger = loggerFactory.CreateLogger($"Aspire.{builder.Resource.Name}");

            if (sqlServerBuilder.Resource is IResourceWithConnectionString c)
            {
                var result = new HealthCheckResult();
                var url = await c.GetConnectionStringAsync();
                var check = new SqlServerHealthCheck(new SqlServerHealthCheckOptions
                {
                    ConnectionString = url!,
                });

                while (result.Status != HealthStatus.Healthy)
                {
                    logger.LogInformation("Waiting for SqlServer");
                    context.Logger?.LogInformation("Waiting for SqlServer");

                    var healthCheckContext = new HealthCheckContext
                    {
                        Registration = new HealthCheckRegistration("", check, HealthStatus.Unhealthy, [])
                    };

                    result = await check.CheckHealthAsync(healthCheckContext, context.CancellationToken);

                    if (result.Status != HealthStatus.Healthy)
                    {
                        await Task.Delay(1000, context.CancellationToken);
                    }
                }

                logger.LogInformation("SqlServer is ready");
                context.Logger?.LogInformation("SqlServer is ready");
            }
        }));

Expected Behavior

Once SqlServer health check is OK, environment callback annotation completes and DbUp resource transits to running state.

Steps To Reproduce

Aspire app:

...

var sqlServer = builder.AddSqlServer("SqlServer");
var db = sqlServer.AddDatabase("DB", "AspireDB");

var dbUp = builder
    .AddProject<Projects.Database_DbUp>("DbUP")
    .WithReference(db)
    .WaitForSqlServer(sqlServer);
...

Integration test with XUnit:

[Fact]
public async Task Consume_Event()
{
        var appHost = await DistributedApplicationTestingBuilder.CreateAsync<Projects.Aspire>();

        appHost.Services.AddLogging(builder => builder
            .AddFilter("Aspire.Resources.Kafka", LogLevel.Warning)
            .AddFilter("Aspire.Resources.SqlServer", LogLevel.Warning)
            .AddFilter("Aspire.Resources.RabbitMQ", LogLevel.Warning)
            .AddXUnit(outputHelper));

        var kafkaResource = appHost.Resources.First(x => x.Name == "Kafka") as IResourceWithConnectionString;
        var kafkaUiResource = appHost.Resources.First(x => x.Name == "Kafka-kafka-ui") as IResourceWithEndpoints;
        var dbResource = appHost.Resources.First(x => x.Name == "DB") as IResourceWithConnectionString;

        await using var app = await appHost.BuildAsync();

        outputHelper.WriteLine($"Starting");

        await app.StartAsync(); // never completes sometimes
}

Exceptions (if any)

Not consistent, sometimes it works, sometimes - no. Every 4th/5th run - it stucks.

.NET Version info

.NET SDK: Version: 8.0.300 Commit: 326f6e68b2 Workload version: 8.0.300-manifests.e0880c5d MSBuild version: 17.10.4+10fbfbf2e

Runtime Environment: OS Name: Mac OS X OS Version: 14.5 OS Platform: Darwin RID: osx-arm64 Base Path: /usr/local/share/dotnet/sdk/8.0.300/

.NET workloads installed: [aspire] Installation Source: SDK 8.0.300 Manifest Version: 8.2.2/8.0.100 Manifest Path: /usr/local/share/dotnet/sdk-manifests/8.0.100/microsoft.net.sdk.aspire/8.2.2/WorkloadManifest.json Install Type: FileBased

Host: Version: 8.0.5 Architecture: arm64 Commit: 087e15321b

.NET SDKs installed: 8.0.300 [/usr/local/share/dotnet/sdk]

.NET runtimes installed: Microsoft.AspNetCore.App 8.0.5 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App] Microsoft.NETCore.App 8.0.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]

Other architectures found: None

Environment variables: Not set

global.json file: Not found

Anything else?

No response

joperezr commented 6 days ago

[Triage] - Can you please provide more info on why you are using two hosts? Can you share more of what is the goal of your setup? It would also be helpful if you can share a minimal repro repository of the issue so we can better investigate.

ealeykin commented 6 days ago

@joperezr Goal of setup and why two hosts:

Will need more time to create repro repo.

davidfowl commented 6 days ago

I would suggest using the built in WaitFor support coming in the next version of Aspire instead of this version that you're showing us in the snippet above.

This code is racy, it assumes that the connection string will be available by the time this code runs which may not be the case. That might be why it works sometimes and breaks at other times.

ealeykin commented 6 days ago

@davidfowl I'm aware of upcoming changes and also yours previous implementation (and it also stuck sometimes), however:

  1. The connection string is available all the time (and if not - it will just retry, but it's not the case). Eventually the health check succeeds all the time in 100% cases, but DbUp host depending on SqlServer is not starting anyway.

  2. Even if I switch to the new WaitFor, that approach with EnvironmentCallbackAnnotation - I like it and would like to use it to pre-create some Kafka topics (for simplicity reasons) - any objections against using it like this ?

  3. The behavior is mostly observed in Integration tests with DistributedApplicationTestingBuilder.

The only thing comes to my mind - try to test the same approach with NET9. However EnvironmentCallbackAnnotation this is something fundamental and basic, shouldn't stuck anyway...

Thanks

davidfowl commented 6 days ago

Even if I switch to the new WaitFor, that approach with EnvironmentCallbackAnnotation - I like it and would like to use it to pre-create some Kafka topics (for simplicity reasons) - any objections against using it like this ?

Yes, you should use the built in feature that we designed and tested for waiting on resources. You would not use this to pre-create kafka topics though, you would use the new eventing system (to avoid abusing the environment variable callback).

The only thing comes to my mind - try to test the same approach with NET9. However EnvironmentCallbackAnnotation this is something fundamental and basic, shouldn't stuck anyway...

If you go with this approach, then you'll need to read the code and understand what might be happening. See where the hang is and try to figure it out. We don't have enough information to help you debug. It's difficult to tell from that code snippet why and what is hanging.