dapr / dotnet-sdk

Dapr SDK for .NET
Apache License 2.0
1.12k stars 341 forks source link

[Workflow] Workflow hangs when running an activity that can't fulfill DI requirements #1118

Open clintsinger opened 1 year ago

clintsinger commented 1 year ago

Expected Behavior

When a workflow activity is created that has DI dependencies that haven't not yet been registered with the DI system the workflow should throw an exception that it wasn't able to create the activity.

Actual Behavior

The workflow quietly hangs giving no indication why the activity isn't executing.

Steps to Reproduce the Problem

  1. Create Activity that requires a dependency injected into it.
  2. Don't register the service to be injected
  3. Run the workflow
  4. Observe that the workflow hangs with no indication of any issue with the activity.

    Release Note

RELEASE NOTE:

cgillum commented 2 months ago

This is an error case, but probably an easy one to run into. I also worry about the behavior, which is a hang and really hard to debug. I think it's worth treating this as P1.

siri-varma commented 2 months ago

/assign

siri-varma commented 2 months ago

/assign

siri-varma commented 2 months ago

image

I ran the sample app and when I did not register the required dependency I see the following error

The workflow failed - Dapr.Workflow.WorkflowTaskFailedException: Task 'NotifyActivity' (#0) failed with an unhandled exception: Unable to resolve service for type 'System.String' while attempting to activate 'WorkflowConsoleApp.Activities.NotifyActivity'.

Ideally if DI fails the whole app startup should fail. Is that the behavior we want ? @clintsinger @cgillum

clintsinger commented 2 months ago

Which app are you using for testing? I tested my own app with 1.14.4 and don't see the same thing that you do.

For my test I created an activity with a constructor as follows:

public LoggerActivity(
    IClock clock,
    ILogger<LoggerActivity> logger
)
{
    this.clock = clock;
    this.logger = logger;
}

Which is called in the workflow as follows:

await context.CallActivityAsync<object?>(
    name: nameof(LoggerActivity),
    input: new LoggerActivityInput(
        Message: $"Doing some work"),
    options: retryPolicy
);

If I don't register the IClock interface (I'm using one from from NodaTime) with the DI then when the workflow runs, the activity constructor isn't called. I can set a breakpoint in the constructor and it will never break. If I register the dependency such as services.AddSingleton<IClock>(SystemClock.Instance); then the activity is created and the breakpoint will be hit as expected.

I don't see any errors when it fails.

siri-varma commented 2 months ago

I used 1.14.4 as well.

I just ran the example https://github.com/dapr/dotnet-sdk/tree/master/examples/Workflow from the dapr/dotnet-sdk

@clintsinger I think it might be because of the retry policy. Can you please paste your "retry policy" ?

Once the retry policy is exhausted, you should be able to see the error. For ex: in my case, I see the error after 30 seconds, till then I see Waiting for instance to complete, fail or terminate

    WorkflowTaskOptions options = new WorkflowTaskOptions()
    {
        RetryPolicy = new WorkflowRetryPolicy(3, TimeSpan.FromSeconds(10))
    };

image

siri-varma commented 1 month ago

@WhitWaldo , @cgillum, @philliphoff

The problem summary is, if the DI is not done correctly, and there is a retry policy, the error only shows up after the retry policy is exhausted.

I believe if the DI is not done correctly, retry strategy should not come into play at all. Let me know what you folks think.

WhitWaldo commented 1 month ago

@siri-varma

On one hand, it makes sense to me that the retry strategy shouldn't come into effect because DI registration was incomplete on the client. The retry policy itself suggests that a failure happens because of some sort of transient occasion, that if we simply wait a moment and try again, there's a chance it might work. Here, that doesn't follow because a DI registration failure is expected to always fail, so retrying in this case is not likely to work.

But on the other hand, I can think of a scenario where DI resolution might fail for transient reasons - say that I'm pulling a secret when resolving a scoped service and that in one moment, my network connection is briefly interrupted and it's unable to complete the secret request. In that scenario, the next time the workflow or activity is instantiated per the retry policy, we might expect that DI resolution does work.

I'm inclined to suggest that the retry policy should remain as-is because of the example I provided above. I think there's a better opportunity to instead write an analyzer that is able to validate DI registration accuracy and notify the developer at build time rather than change the runtime behavior here.

clintsinger commented 1 month ago

I'd lean more towards what @WhitWaldo is proposing.

I work in Visual Studio and generally when there is a DI issue an exception is thrown and caught in the debugger because the object couldn't be created due to missing dependencies. That's the kind of experience I would normally expect.

If it could be handled at compile time, then that is a better experience in all honesty but may be a challenge because of how DI conditions may be fulfilled at runtime.

siri-varma commented 1 month ago

May be this is a suitable candidate for an integration test.

@clintsinger What do you think about writing an integration test to catch the issue ? The Integration test can run locally or part of the build

@WhitWaldo thoughts ?

WhitWaldo commented 1 month ago

@siri-varma I don't think an integration test necessarily solves the problem as reported. Sure, we might be able to improve the error message here so it's more easily diagnosed as the issue, but if the problem is that a user is DI-injecting a service into an activity that isn't properly registered, ideally that's something their own integration tests would catch.

Just including an E2E test for it ourselves is only going to prove that this happens exactly as advertised as it's not an issue we can readily fix in the SDK since it's more a user-error problem.

This is why I suggest an analyzer that runs when the developer is building this on their own machine (and highlights potential issues). Unfortunately, I can think of any number of scenarios that would make this quite difficult to write a comprehensive analyzer for (I've taken a few personal stabs at it in the past). It's definitely not something I have the time to complete before 1.15, but I think there are a collection of analyzers that could prove useful here. Might be worth making a tracking issue to list them all.