Excessive new connections to Postgres

dhiaayachi commented 1 month ago

Expected Behavior

Under load the number of connections to the Postgres database backend remains fairly consistent over time.

Actual Behavior

When under load it has been observed that there are many new connections being made to the Postgres database being used as the DB backing for the history service. (200+ new connections per second during load test). The expectation is that the number of connections may rise to handle the load but it should achieve a steady state and relatively few connections killed and re-established.

One suggestion is that the method get may be being called frequently and for some reason the refcount is not incremented so remains at 0 and many new connections returned. To look into the problem further to see if this theory is valid.

https://github.com/temporalio/temporal/blob/b383ffffcbbeacdfce2fe021c30f093bab64b5d9/common/persistence/sql/factory.go#L195

Steps to Reproduce the Problem

Install self-hosted platform using Postgres as the DB
Setup montoring for the number of connections/new connections being made to DB
Run load test

Specifications

Version: 1.24
Platform: Kubernetes via Helm charts

dhiaayachi commented 1 month ago

Thanks for reporting this issue! This appears to be related to the "blob size limit" error, which can be caused by large payloads exceeding the 2MB request limit or the 4MB Event History transaction limit set by Temporal.

There are a few things you can try to resolve this:

Reduce Blob Size:
- Compression: Compress your payloads before sending them to Temporal. You can use libraries like gzip or zlib for this.
- Batching: If your data can be processed in smaller chunks, batch it into smaller payloads to avoid exceeding the limit.
Increase Blob Limit:
- Dynamic Config: You can temporarily increase the blob size limits using Temporal's dynamic configuration.
  - For the BlobSizeLimitWarn setting, you can try setting the value to 512 KB (512 * 1024) using the temporal server start-dev command:
```
temporal server start-dev --dynamic-config-value limit.blobSize.warn=524288
```
  - You can also use this approach to temporarily increase the BlobSizeLimitError setting.
Code Optimization:
- Streamlining: Review your code to see if there are any areas where you can reduce the size of the payloads being sent to Temporal. For example, you can optimize your database queries, or eliminate unnecessary data fields.

See BlobSizeLimitError in the troubleshooting documentation for more details.

dhiaayachi commented 1 month ago

Thanks for reporting this issue. It appears the issue you are encountering may be related to the refcount not being incremented properly in the get method.

To understand this better, I'd like to ask a few questions:

What is the exact version of Temporal you are using?
Are you using any custom code for connection pooling?
Can you share the details of your monitoring setup?

Once I have this information, I can provide a more precise solution to the issue.

dhiaayachi / temporal