apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.67k stars 4.19k forks source link

Properly close Storage API batch connections #31710

Closed ahmedabu98 closed 2 days ago

ahmedabu98 commented 3 days ago

Storage API connections in batch are left open and not closed properly. This is because we pin the underlying StreamAppendClient twice: once for the bundle and once for the cache When we are finished with the stream however, we only unpin once for the bundle (and not for the cache).

Batch mode already creates a lot of streams and connections (one stream/connection per destination per bundle) compared to streaming mode. Leaving connections unclosed leads to many concurrent connections and can quickly exhaust the quota.

This change adds a line to invalidate the cached client after we finish using it in a bundle.

Also creates a counter to keep track of active connections.

ahmedabu98 commented 3 days ago

I ran two identical pipelines (writing 10B records) before and after this change to measure the difference (this is in a project with large quota):

Job before fix (2024-06-27_22_27_46-11409968544737429803)

Job after fix (2024-06-27_22_53_21-1016146269936299010):

Side by side comparison (the jobs were run sequentially):

image

There is a significant reduction (roughly ~70%) in concurrent connections. We can reliably expect the number of concurrent connections per destination to be capped at the number of parallel DoFns (or vCPUs). In other words, concurrent connections <= (num vCPUs) x (num destinations)


Note that AppendRows throughput stayed roughly the same:

image

ahmedabu98 commented 3 days ago

R: @reuvenlax R: @Abacn

github-actions[bot] commented 3 days ago

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

Abacn commented 3 days ago

thanks, LGTM, this is a great improvement!

reuvenlax commented 2 days ago

To be honest I'm not sure why we need the APPEND_CLIENTS cache at all in batch (non default stream) mode. However this does seem to be a simpler fix than removing the cache.

On Fri, Jun 28, 2024 at 7:05 AM Yi Hu @.***> wrote:

thanks, LGTM, this is a great improvement!

— Reply to this email directly, view it on GitHub https://github.com/apache/beam/pull/31710#issuecomment-2197009850, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAYJVP5L6WDCVT2RKCPPJ3ZJVURJAVCNFSM6AAAAABKBF3IQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGAYDSOBVGA . You are receiving this because you were mentioned.Message ID: @.***>

liferoad commented 2 days ago

@ahmedabu98 can we add this to CHANGES.md given it is a quite important fix?

ahmedabu98 commented 2 days ago

@ahmedabu98 can we add this to CHANGES.md given it is a quite important fix?

Yup forgot to add it here. Adding it in #31721