Closed ahmedabu98 closed 2 days ago
I ran two identical pipelines (writing 10B records) before and after this change to measure the difference (this is in a project with large quota):
Creates 1,950 connections and never closes them (they eventually time out).
283 workers
Up to 550 connections at a time. All connections get cleaned up before the pipeline is finished
~250 GiB append rows throughput
Finishes in 21.5 min
Dataflow cost estimate: $6.93
There is a significant reduction (roughly ~70%) in concurrent connections. We can reliably expect the number of concurrent connections per destination to be capped at the number of parallel DoFns (or vCPUs). In other words, concurrent connections <= (num vCPUs) x (num destinations)
R: @reuvenlax R: @Abacn
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control
thanks, LGTM, this is a great improvement!
To be honest I'm not sure why we need the APPEND_CLIENTS cache at all in batch (non default stream) mode. However this does seem to be a simpler fix than removing the cache.
On Fri, Jun 28, 2024 at 7:05 AM Yi Hu @.***> wrote:
thanks, LGTM, this is a great improvement!
— Reply to this email directly, view it on GitHub https://github.com/apache/beam/pull/31710#issuecomment-2197009850, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAYJVP5L6WDCVT2RKCPPJ3ZJVURJAVCNFSM6AAAAABKBF3IQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGAYDSOBVGA . You are receiving this because you were mentioned.Message ID: @.***>
@ahmedabu98 can we add this to CHANGES.md given it is a quite important fix?
@ahmedabu98 can we add this to CHANGES.md given it is a quite important fix?
Yup forgot to add it here. Adding it in #31721
Storage API connections in batch are left open and not closed properly. This is because we pin the underlying StreamAppendClient twice: once for the bundle and once for the cache When we are finished with the stream however, we only unpin once for the bundle (and not for the cache).
Batch mode already creates a lot of streams and connections (one stream/connection per destination per bundle) compared to streaming mode. Leaving connections unclosed leads to many concurrent connections and can quickly exhaust the quota.
This change adds a line to invalidate the cached client after we finish using it in a bundle.
Also creates a counter to keep track of active connections.