elsa-workflows / elsa-core

A .NET workflows library
https://v3.elsaworkflows.io/
MIT License
6.51k stars 1.2k forks source link

[BUG] Tenant ID Not Propagating Correctly in Multi-Tenancy Setup When Triggering Workflows #6129

Open sfmskywalker opened 2 days ago

sfmskywalker commented 2 days ago

When setting up Elsa.Server.Web with MultiTenancy support, workflows are successfully created and stored in the database with the correct Tenant ID populated. The API and UI correctly pass the Tenant ID during workflow creation. However, an issue arises when manually triggering a workflow from the UI.

The problem appears to occur within the ProtoActor runtime. Specifically, the Tenant ID is not propagated correctly during the call to CreateAndRun. As a result, the Tenant ID is NULL in the SQL command executed by Entity Framework, which leads to an error indicating that the workflow definition cannot be found.

Analysis suggests that a new IServiceScope may be created during execution by IServiceScopeFactory, causing the TenantAccessor to reset and lose the Tenant ID context. Without the Tenant ID, the database query cannot locate the appropriate workflow definition for the tenant.

This issue is observed after integrating the new MultiTenancy packages (currently in preview). Previously, the same workflow execution process functioned as expected.

Error Details

The specific error encountered is as follows:

System.Exception: One or more errors occurred. (Workflow DefinitionVersionId: <ID from database> not found.)
    at Elsa.Workflows.Runtime.ProtoActor.ProtoBuf.WorkflowInstanceClient.CreateAndRun(CreateAndRunWorkflowInstanceRequest request, IDictionary`2 headers, CancellationToken ct)

Steps to Reproduce

  1. Configure Elsa.Server.Web with MultiTenancy support.
  2. Set up the API and UI to pass the Tenant ID correctly.
  3. Create a workflow through the UI, ensuring the Tenant ID is populated in the database.
  4. Attempt to manually trigger the workflow from the UI.

Expected Behavior

The Tenant ID should propagate through the Actor message headers to ensure the correct tenant's workflow instance is located and executed.

Actual Behavior

The Tenant ID is not propagated correctly, resulting in the following:

Possible Cause

The issue may stem from a new IServiceScope being created during the CreateAndRun call, resetting the TenantAccessor and losing the Tenant ID context. Consequently, the Tenant ID becomes NULL when the database query is executed.

Suggested Next Steps

  1. Investigate whether the IServiceScopeFactory or any internal scope creation during workflow execution is improperly resetting the TenantAccessor.
  2. Verify that the Tenant ID is correctly included in the Actor message headers.
  3. Ensure that the Tenant ID is preserved across all service scopes during the workflow execution flow.
  4. Debug the MultiTenancy packages locally to trace where the Tenant ID context might be lost.
CODEdire commented 2 days ago

I was able to debug down to the creation of the scope, but it now looks like the ITenantAccessor wasn't initialized before the scope creation and therefore does not have a Tenant set before or after the scope creation. This may be further upstream than just the scope creation. Could the actor be running in a different service scope from the message received event?

CODEdire commented 2 days ago

I think the issue is with the DefaultTenantContextInitializer and how AsyncLocal works on the TenantAccessor. I ran a quick test taking out the AsyncLocal from the TenantAccessor just to see if it worked and messages are now containing the Tenant ID. I think this might be an issue with how AsyncLocal works against nested async methods and since it sets it under a nested async, this is causing it to lose "scope" after InitializeAsync and so it gets reset to NULL.

CODEdire commented 2 days ago

I think the issue is with the DefaultTenantContextInitializer and how AsyncLocal works on the TenantAccessor. I ran a quick test taking out the AsyncLocal from the TenantAccessor just to see if it worked and messages are now containing the Tenant ID. I think this might be an issue with how AsyncLocal works against nested async methods and since it sets it under a nested async, this is causing it to lose "scope" after InitializeAsync and so it gets reset to NULL.

Following this up with an additional test. I took the lines that were in the DefaultTenantContextInitializer out of that class and used the ITenantFinder and ITenantAccessor directly in the receive middleware. Results show that that this is working correctly with the Tenant ID being passed around. This leads me to conclude that the InitializeAsync is having an issue with the AsyncLocal since it would naturally want to reset the variable once outside of the InitializeAsync.