dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

Frontend Service - goroutine (CPU & Memory) Leak #103

Open dhiaayachi opened 2 weeks ago

dhiaayachi commented 2 weeks ago

Expected Behavior

There should be no memory leak resulting from objects not being properly garbage collected.

Actual Behavior

Number of objects on Heap keeps growing. This seems to be result in slow increase of cpu & memory usage eventually resulting in outage.

Steps to Reproduce the Problem

temporal server start-dev --port 7233 --ui-port 8233 --metrics-port 9233

image



- Flame graph

![flame_graph](https://github.com/user-attachments/assets/44044c3b-2fae-4160-8b87-28f2e6b80407)

- This was observed across local & shared & production environments. Please see the prometheus chart in a production environment where the num_goroutine count kept increasing until a restart. Notice that the "leap" appears to be isolated to the frontend service. The rest seem fine. The CPU usage & memory usage charts followed the same pattern.

![image](https://github.com/user-attachments/assets/034b8b5e-96b2-4879-85ac-18def393131e)

## Specifications
This was observed across multiple versions
  - Server Version: 1.22.5, 1.22.7, 1.23.1, etc.
  - Platform: Linux
  - MTLS enabled
  - Auth disabled

## Links
This issue is potentially related to https://community.temporal.io/t/high-cpu-usage-memory-leakage-on-frontend-service/4246/1 
dhiaayachi commented 1 day ago

Temporal Server Memory Leak Issue

This GitHub issue describes a potential memory leak in the Temporal Server, causing a gradual increase in CPU and memory usage, ultimately leading to outages.

Expected Behavior:

Actual Behavior:

Steps to Reproduce:

  1. Start the Temporal Server in dev mode:
    temporal server start-dev --port 7233 --ui-port 8233 --metrics-port 9233
  2. Avoid starting any workflows or making any gRPC calls using the SDKs or the Web UI.
  3. Monitor pprof and metrics periodically. This shows a perpetual increase in goroutine counts, objects on the heap (memory allocations for objects), and total memory allocations.

Observations:

Specifications:

Potential Related Issues:

Recommendations:

By following these steps, you can hopefully pinpoint the root cause of the memory leak and take corrective actions to ensure a stable and reliable Temporal Service.

dhiaayachi commented 22 hours ago

Thanks for reporting this issue. Could you please tell me what version of Temporal you are using? Is it Temporal Cloud or self-hosted? Also, could you provide the details of your database setup? This will help us diagnose the issue further.

dhiaayachi commented 6 hours ago

Thank you for reporting this issue.

Could you please confirm if the Frontend service was started with the --ui-port flag?

This flag is necessary to enable the UI, which provides insights into the system.

Also, please let me know if you've attempted any troubleshooting steps or tried running the Temporal Server in a different environment?

The Troubleshooting section in our documentation might have some helpful information.