Open dhiaayachi opened 2 months ago
Thanks for reporting this issue. We've seen similar issues with MariaDB in clustered mode (using Galera).
There are a few potential causes for this:
Network Issues: Galera relies on a reliable network. If there are intermittent network issues between your MariaDB nodes, it can lead to delays in data replication and impact the Workflow's ability to get the latest data.
MariaDB Version Compatibility: While MariaDB 10.6.9 is supported, there might be specific configuration or compatibility nuances you need to be aware of. Please ensure you have followed any specific guidance related to using MariaDB in clustered mode with Temporal.
Database Connection Pooling: Make sure you have a robust database connection pool in your Temporal worker. Insufficient connection pooling can lead to timeouts when your worker tries to connect to the clustered MariaDB.
Workflow Task Timeout: If your workflows are configured with a short WorkflowTaskTimeout
(the default is 10 seconds), it's possible they are timing out before the activity can complete. Consider increasing the WorkflowTaskTimeout
to allow more time for the worker to process the workflow task.
Steps to Troubleshoot:
Review the Galera documentation: Ensure your Galera cluster is configured and operating as expected. Look for any warnings or errors related to data replication or network connectivity.
Check your Temporal configuration: Verify that your Temporal worker is correctly configured to connect to your MariaDB cluster. Pay attention to connection pooling settings and the WorkflowTaskTimeout
.
Investigate MariaDB logs: Look for errors or warnings in your MariaDB logs that might indicate problems with data replication, connection issues, or performance bottlenecks.
Monitor Temporal Service metrics: Observe Temporal metrics like WorkflowTaskStarted
and WorkflowTaskCompleted
to identify any patterns of failures or delays.
If the issue persists, please provide more details about your setup:
Let me know if you have any further questions.
Thank you for reporting the issue!
We are aware of an issue with Temporal 1.20 and MariaDB 10.6.9 in clustered mode (Galera) that can cause workflows to become stuck in "Running" status.
Here are a couple of things you can try:
Please also share the following details to help us further troubleshoot the issue:
We are working on a fix for this issue and will provide an update as soon as it is available.
Expected Behavior
We are using Temporal 1.20 with MariaDB 10.2.30 and now we moved to new MariaDB 10.6.9 running in clustered mode (using Galera https://galeracluster.com/)
Actual Behavior
Testing with new Version we observe issues on some workflows, workflow stays in status "Running", activity is scheduled but nothing happens.
Steps to Reproduce the Problem
Specifications