dapr / dotnet-sdk

Dapr SDK for .NET
Apache License 2.0
1.11k stars 337 forks source link

Dapr.DaprApiException: error invoke actor method: error finding address for actor #1011

Closed jsmith-kno2 closed 1 year ago

jsmith-kno2 commented 1 year ago

Overview

When running a dapr enabled app via Docker Compose for the first time, the actor proxy instance returned by IActorProxyFactory throws the following exception when calling any method on that actor:

Dapr.DaprApiException: error invoke actor method: error finding address for actor type TerminalMessagesActor with id FailTerminalMessagesAsync
  at Dapr.Actors.DaprHttpInteractor.SendAsyncHandleUnsuccessfulResponse(Func`1 requestFunc, String relativeUri, CancellationToken cancellationToken)
  at Dapr.Actors.DaprHttpInteractor.SendAsync(Func`1 requestFunc, String relativeUri, CancellationToken cancellationToken)
  at Dapr.Actors.DaprHttpInteractor.InvokeActorMethodWithRemotingAsync(ActorMessageSerializersManager serializersManager, IActorRequestMessage remotingRequestRequestMessage, CancellationToken cancellationToken)
  at Dapr.Actors.Communication.Client.ActorRemotingClient.InvokeAsync(IActorRequestMessage remotingRequestMessage, CancellationToken cancellationToken)
  at Dapr.Actors.Client.ActorProxy.InvokeMethodAsync(Int32 interfaceId, Int32 methodId, String methodName, IActorRequestMessageBody requestMsgBodyValue, CancellationToken cancellationToken)
  at MyProject.HostedServices.OnStartService.ExecuteAsync(CancellationToken cancellationToken) in C:\Code\MySolution\MyProject\HostedServices\OnStartService.cs:line 32

This only happens on the initial creation of the Docker containers for my app and the dapr sidecar. Once created, I can stop/restart and everything works fine. However, I can consistently recreate this exception by deleting the Docker Compose stack in Docker Desktop (just the containers, not the images) and re-run the application.

If I step through the code (slowing things down) or add a delay of at least 2 seconds between the creation of the actor proxy instance and invoking the method on the actor, it works fine. I've noticed that if I run a delay of 1 second, it still fails but with a 2 second day, it succeeds and is very close to this log entry:

[20:32:04 DBG] Health check processing with combined status Healthy completed after 2.3732ms

My question: is there a way to explicitly confirm dapr is ready and healthy before attempting to invoke a method on an actor proxy instance?

Additional Info

The invocation of the actor is to self-register a reminder. I'm using this approach to create a microservice responsible for platform maintenance operations that need to be executed singularly on a specific schedule. I want the microservice to contain the reminder registration mechanics rather than rely on an external event to trigger reminder registration.

To accomplish this, I've created a background service that performs the registration. Perhaps there's a better way to approach this problem and I'd love to hear if that's the case but assuming this is a reasonable approach vector, this feels like a race condition related to ready checks.

I've attempted to solve this by first waiting for the application to fully start as well as a successful health check from the dapr client. While that's addressed consistent exceptions thrown in earlier iterations of this approach, I'm still stuck with this exception above.

Here's the background service implementation with the 2 second delay that will avoid the exception above:

public class OnStartService : BackgroundService
{
    private readonly IActorProxyFactory _actorProxyFactory;
    private readonly DaprClient _daprClient;
    private readonly ILogger _logger;
    private readonly IHostApplicationLifetime _hostApplicationLifetime;

    public OnStartService(IActorProxyFactory actorProxyFactory, DaprClient daprClient, ILogger<OnStartService> logger, IHostApplicationLifetime hostApplicationLifetime)
    {
        _actorProxyFactory = actorProxyFactory;
        _daprClient = daprClient;
        _logger = logger;
        _hostApplicationLifetime = hostApplicationLifetime;
    }

    protected override async Task ExecuteAsync(CancellationToken cancellationToken)
    {
        try
        {
            var healthCheckCount = 0;
            while ((!_hostApplicationLifetime.ApplicationStarted.IsCancellationRequested || !await _daprClient.CheckHealthAsync(cancellationToken)) && healthCheckCount < 10)
            {
                await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken);
            }
            var actorId = new ActorId(nameof(ITerminalMessagesActor.FailTerminalMessagesAsync));
            var actorType = nameof(TerminalMessagesActor);
            var actor = _actorProxyFactory.CreateActorProxy<ITerminalMessagesActor>(actorId, actorType);
            await Task.Delay(TimeSpan.FromSeconds(2));
            _logger.LogDebug($"{nameof(OnStartService)} triggering {nameof(ITerminalMessagesActor.RegisterReminderAsync)}");
            await actor.RegisterReminderAsync(cancellationToken);
        }
        catch (Exception exception)
        {
            _logger.LogCritical(exception, "Unable to register reminder(s).");
        }
    }
}

I've also tried using DaprClient.CheckOutboundHealthAsync and DaprClient.WaitForSidecarAsync, both singularly and all together.

Also of note, I am using the default dapr placement and redis containers created by dapr init for my Docker Compose stack by using host.docker.internal as the hostname in the component configs.

Any thoughts on how to address? Is this a failure of my understanding or a race condition in the dapr sdk?

halspang commented 1 year ago

@jsmith-kno2 - The important thing to remember with actor registration is that on the dapr runtime side, it is a multi-step process and actors are one of the last things to get setup for the actual runtime.

The chain of events here is something like:

So you can see here the issue is two-fold. First, Outbound health is strictly for component/API availability. It does not mean that Dapr has fully started. The reason it exists is because your app cannot wait for the standard healthz endpoint while starting as Dapr is waiting for your app to fully start. Outbound is there to solve the case of "my app needs a secret store to start, can I wait for that".

The 2nd problem is that the placement server is separate from the core runtime. It needs to track all of the actor hosts and handle the balancing of requests. As such, we can't tie it to the sidecar's health.

All of this is to say that if you move to the other health endpoint, you'll certainly be closer to the actual time that your actors have been loaded. However, you'll still either want a wait or a retry inserted in there. If you enabled resiliency for actors, you should be able to set that up, or you can use a .NET library if that's your preference.

Please let me know if you have any questions about this explanation.

jsmith-kno2 commented 1 year ago

This is great information. Thank you for the insight and advice!

halspang commented 1 year ago

@jsmith-kno2 - Do you need anything else? Or can we close this issue?