dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

When I tested with more than 200 workers connected,Too high memory usage on matching service #250

Open dhiaayachi opened 2 weeks ago

dhiaayachi commented 2 weeks ago

When I tested with more than 200 workers connected, I encountered some issues; my server memory was full,Too high memory usage on matching service and subsequently, I discovered the disk IO was also maxed out. I deployed Temporal using Helm and utilized a standalone MySQL. My k8s node has 2 CPU cores and 2GB of memory. I want to know how should I configure my cluster if there are tens of thousands of workers connected.

dhiaayachi commented 3 days ago

Thanks for reporting this!

It seems like you are encountering performance issues with your Temporal cluster when you have a large number of workers connected. This is a common problem when you are dealing with a high volume of tasks, and it is important to correctly configure your cluster to handle this type of load.

Let's break down the issues you're facing and explore some solutions:

It seems like you have some understanding of how to configure your cluster using Helm and standalone MySQL. However, I need a few more details to give you specific recommendations:

  1. What version of Temporal Server are you using? The configuration options can vary between versions.
  2. What are the current resource limits for your Temporal pods in Kubernetes? This will help me to understand how much you are already provisioning for each component.
  3. What is your expected workload? Knowing the number of tasks and the frequency at which they are generated will allow me to suggest appropriate scaling configurations.

Once I have this information, I can help you find the best configuration for your cluster to handle tens of thousands of workers.

In the meantime, I recommend reviewing the following documentation:

I hope this information is helpful. Please feel free to ask any further questions.

dhiaayachi commented 3 days ago

Thank you for reporting this issue. It appears that your setup is experiencing resource constraints due to a large number of workers and limited node resources.

Based on the provided information, the memory, disk I/O, and CPU limitations of your current Kubernetes node are a major bottleneck when scaling to tens of thousands of workers.

Here are some recommendations for configuring your cluster to handle such a large number of workers:

For specific guidance on tuning worker configurations and optimizing cluster setup, refer to the following documentation:

Remember to monitor your cluster resources and metrics closely to understand the performance impact of your configuration changes. Feel free to reach out to Temporal support if you have further questions or require additional assistance in troubleshooting your setup.

dhiaayachi commented 3 days ago

Thank you for reporting this issue.

Based on your description, it appears that you are experiencing performance issues when running Temporal with a large number of workers.

Let's go through your setup and see what we can adjust to improve things:

  1. Resource constraints: You mention a Kubernetes node with 2 CPU cores and 2GB of RAM. This is likely insufficient for a Temporal cluster handling tens of thousands of workers. For a large-scale deployment, consider significantly increasing your node resources, especially RAM.
  2. MySQL: While MySQL is a popular database, Cassandra is generally better suited for Temporal's workload due to its distributed nature and high-performance characteristics.
  3. Worker Sessions: To manage the number of workers connected, consider exploring the Worker Sessions feature. It simplifies Task Routing, ensuring Activities execute with the same Worker, potentially reducing contention and improving performance. Worker Sessions
  4. Visibility: If you are using a large number of workers, a separate Elasticsearch Visibility store might be beneficial. Elasticsearch is highly scalable and optimized for searching. Self-hosted Visibility feature setup
  5. Task Queue Metrics: Monitoring Task Queue metrics can help you understand the workload. Ensure your Task Queues have enough partitions for the anticipated task rate. You can use the Temporal CLI to get Task Queue details. Monitor Task Queue backlog metrics

It is important to keep in mind that tuning Temporal for optimal performance requires a combination of these strategies. It's crucial to understand your workload patterns and iterate on your configuration to find the right balance for your environment.

Please let me know if you have any further questions or need additional assistance.