When I tested with more than 200 workers connected，Too high memory usage on matching service

When I tested with more than 200 workers connected, I encountered some issues; my server memory was full,Too high memory usage on matching service and subsequently, I discovered the disk IO was also maxed out. I deployed Temporal using Helm and utilized a standalone MySQL. My k8s node has 2 CPU cores and 2GB of memory. I want to know how should I configure my cluster if there are tens of thousands of workers connected.

Thanks for reporting this!

It seems like you are encountering performance issues with your Temporal cluster when you have a large number of workers connected. This is a common problem when you are dealing with a high volume of tasks, and it is important to correctly configure your cluster to handle this type of load.

Let's break down the issues you're facing and explore some solutions:

Server memory is full: With 2GB of memory and tens of thousands of workers, this is a given. You need to increase the memory resources for your Kubernetes node. This will allow Temporal to handle the increased load from the workers.
High memory usage on the matching service: Again, with a high volume of tasks, the matching service will use more memory to process them. You can address this by increasing the memory resources for the matching service pod in Kubernetes, or by tuning the matching service configuration to reduce memory usage, but you may need to increase the node memory.
Disk IO is maxed out: A full memory server and lots of workers will result in disk IO being maxed out. This is because Temporal uses the disk to store the Event History, which can grow quickly with a high volume of tasks. You can address this by increasing the disk resources for your Kubernetes node and/or increasing the number of shards.

It seems like you have some understanding of how to configure your cluster using Helm and standalone MySQL. However, I need a few more details to give you specific recommendations:

What version of Temporal Server are you using? The configuration options can vary between versions.
What are the current resource limits for your Temporal pods in Kubernetes? This will help me to understand how much you are already provisioning for each component.
What is your expected workload? Knowing the number of tasks and the frequency at which they are generated will allow me to suggest appropriate scaling configurations.

Once I have this information, I can help you find the best configuration for your cluster to handle tens of thousands of workers.

In the meantime, I recommend reviewing the following documentation:

System Limits - Temporal Cloud: This page describes the limits of the Temporal Cloud system, which may help you understand some of the limitations of your self-hosted cluster.
Temporal Cluster Configuration Reference: This page describes the configuration options for the Temporal Server.
Developer's Guide - Worker Performance: This page provides guidance on tuning Worker performance, which is crucial when dealing with a large number of workers.
Temporal CLI: You can use the Temporal CLI to monitor the health of your cluster and identify bottlenecks.

I hope this information is helpful. Please feel free to ask any further questions.

Thank you for reporting this issue. It appears that your setup is experiencing resource constraints due to a large number of workers and limited node resources.

Based on the provided information, the memory, disk I/O, and CPU limitations of your current Kubernetes node are a major bottleneck when scaling to tens of thousands of workers.

Here are some recommendations for configuring your cluster to handle such a large number of workers:

Increase Node Resources: You'll need to significantly increase the resources allocated to your Kubernetes nodes. Consider using nodes with more CPU cores, more memory (significantly more than 2GB), and potentially larger storage.
Optimize Worker Configuration:
- Reduce Worker Size: If possible, consider optimizing your worker code to reduce resource consumption.
- Task Queue Partitioning: Increase the number of partitions for your Task Queues to improve the distribution of tasks across workers.
- Sticky Execution: Adjust the size of the Sticky Execution cache to optimize worker performance.
Horizontal Scaling: Use Kubernetes to automatically scale your worker pods based on load.
Distributed Architecture: Consider deploying a distributed architecture using multiple Kubernetes clusters to further distribute the workload.

For specific guidance on tuning worker configurations and optimizing cluster setup, refer to the following documentation:

Worker performance: https://docs.temporal.io/develop/worker-performance
Task Queues: https://docs.temporal.io/workers#task-queue
Sticky Execution: https://docs.temporal.io/workers#sticky-execution
System Limits (Temporal Cloud): https://docs.temporal.io/cloud/limits

Remember to monitor your cluster resources and metrics closely to understand the performance impact of your configuration changes. Feel free to reach out to Temporal support if you have further questions or require additional assistance in troubleshooting your setup.

Thank you for reporting this issue.

Based on your description, it appears that you are experiencing performance issues when running Temporal with a large number of workers.

Let's go through your setup and see what we can adjust to improve things:

Resource constraints: You mention a Kubernetes node with 2 CPU cores and 2GB of RAM. This is likely insufficient for a Temporal cluster handling tens of thousands of workers. For a large-scale deployment, consider significantly increasing your node resources, especially RAM.
MySQL: While MySQL is a popular database, Cassandra is generally better suited for Temporal's workload due to its distributed nature and high-performance characteristics.
Worker Sessions: To manage the number of workers connected, consider exploring the Worker Sessions feature. It simplifies Task Routing, ensuring Activities execute with the same Worker, potentially reducing contention and improving performance. Worker Sessions
Visibility: If you are using a large number of workers, a separate Elasticsearch Visibility store might be beneficial. Elasticsearch is highly scalable and optimized for searching. Self-hosted Visibility feature setup
Task Queue Metrics: Monitoring Task Queue metrics can help you understand the workload. Ensure your Task Queues have enough partitions for the anticipated task rate. You can use the Temporal CLI to get Task Queue details. Monitor Task Queue backlog metrics

It is important to keep in mind that tuning Temporal for optimal performance requires a combination of these strategies. It's crucial to understand your workload patterns and iterate on your configuration to find the right balance for your environment.

Please let me know if you have any further questions or need additional assistance.

dhiaayachi / temporal

When I tested with more than 200 workers connected，Too high memory usage on matching service #250