dotnet / aspire

An opinionated, cloud ready stack for building observable, production ready, distributed applications in .NET
https://learn.microsoft.com/dotnet/aspire
MIT License
3.66k stars 418 forks source link

Support restart on fail for executables and containers #1708

Open bit365 opened 8 months ago

bit365 commented 8 months ago

When container A starts, the service in container A needs some startup time. At this time, service B has already started and is highly dependent on the service in container A. Therefore, service B ends due to an exception, such as the subscription dependency of Dapr sidecar. on the RabbitMQ service queue.

builder.AddRabbitMQContainer("rabbitmq")
    .WithVolumeMount("./configs/rabbitmq/enabled_plugins.conf", "/etc/rabbitmq/enabled_plugins")
    .WithVolumeMount("./configs/rabbitmq/rabbitmq.conf", "/etc/rabbitmq/rabbitmq.conf")
    .WithEndpoint(1883, 1883)
    .WithEndpoint(15672, 15672)
    .WithEndpoint(5672, 5672)
    .WithEnvironment("TZ", "Asia/Shanghai");

var apiService = builder.AddProject<Projects.AspireApp_ApiService>("apiservice");
var apiServiceDapr = apiService.WithDaprSidecar(sidecarBuilder =>
{
    sidecarBuilder.WithOptions(new DaprSidecarOptions
    {
        AppId = "apiservice",
        ResourcesPaths = ["./components"],
        LogLevel = "debug"
    })
    .WithEnvironment("AMQP_CONNECTION_STRING", "amqp://test:hellotest@localhost:5672")
    .WithEnvironment("MQTT_CONNECTION_STRING", "tcp://test:hellotest@localhost:1883");
});

In the above code, the container has been started, but RabbitMQ in the container has not yet been started. At this time, the Dapr sidecar has been started and requires the RabbitMQ service. However, because the service cannot be found, the Dapr sidecar stops working and there is nothing to set. Try the restart strategy, and then although RabbitMQ has started successfully, Dapr becomes dead.

Wish there was an option to automatically try to restart after failure, so that would solve the problem.

The container can be restarted after failure through startup parameter settings, but I don't want to containerize the Dapr sidecar, so there is no option to restart.

sidecarBuilder.WithRestartAlways()
.WithRestartAlways(10) // 10 seconds

Hopefully something like this setup.

davidfowl commented 8 months ago

@karolz-ms we discussed this a while ago. Since we don't have startup dependencies, we need a way to restart on fail when containers need to connect to each other, and we are unable to control retries inside of that container.

karolz-ms commented 8 months ago

Yep, we should be able to address it sometime before GA.

ohroy commented 7 months ago

+1

karolz-ms commented 7 months ago

@karolz-ms we discussed this a while ago. Since we don't have startup dependencies, we need a way to restart on fail when containers need to connect to each other, and we are unable to control retries inside of that container.

We have also seen cases when a container starts successfully and fairly quickly, but then takes 10+ seconds to actually start responding to requests. So the startup dependencies would not help in that case.

Best solution is to make the client robust (retry connections with exponential backoff). Second best is to retry the failed containers/executables a few times before giving up, which is what this feature is about.

bit365 commented 7 months ago

In fact, there are many ways to solve this problem, and you have to consider how to design it.

  1. Try to restart continuously (number of times, exponential backoff algorithm, try time interval).
  2. Use the health endpoint provided by the service in the container to monitor health status.
  3. Start in sequence through certain dependency settings.

etc.

dbreshears commented 6 months ago

Need to push this out to post GA as this should rely on container replica sets which won't be implemented prior to GA.