[Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow

boolangery commented 1 year ago

What happened?

Hi,

We updated a Pubsub streaming job on Dataflow from 2.46.0 to 2.49.0. See these memory diagrams:

2.46.0 memory utilisation:

2.49.0 memory utilisation:

We sent back on the 2.46.0 for this job as workers were running out of memory and a lot of lag was introduced.

Do you have an explanation? What changed on memory management between 2.46.0 and 2.49.0?

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

[ ] Component: Python SDK
[ ] Component: Java SDK
[X] Component: Go SDK
[ ] Component: Typescript SDK
[ ] Component: IO connector
[ ] Component: Beam examples
[ ] Component: Beam playground
[ ] Component: Beam katas
[ ] Component: Website
[ ] Component: Spark Runner
[ ] Component: Flink Runner
[ ] Component: Samza Runner
[ ] Component: Twister2 Runner
[ ] Component: Hazelcast Jet Runner
[X] Component: Google Cloud Dataflow Runner

Abacn commented 1 year ago

Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ?

The symptom reported here is general and hard to find cause without the job info available

Abacn commented 1 year ago

one thing at least could check is to see if 2.47 and 2.48 had the same symptom thus narrow down the issue

boolangery commented 1 year ago

Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ?

The symptom reported here is general and hard to find cause without the job info available

Sure, where I can submit this? Can't find anything in GCP, do you have a link? Thanks

scwhittle commented 1 year ago

This issue appears to occur in 2.48 as well with a pipeline just consuming from Cloud Pubsub.

    _ = pipeline | "Read pubsub" >> io.ReadFromPubSub(
        subscription=sub, with_attributes=True
    )

liferoad commented 1 year ago

Hi, could you please raise a customer issue to Dataflow, as jobId / job graph are needed for triaging ? The symptom reported here is general and hard to find cause without the job info available

Sure, where I can submit this? Can't find anything in GCP, do you have a link? Thanks

Please check this: https://cloud.google.com/dataflow/docs/support/getting-support#file-bugs-or-feature-requests

tvalentyn commented 1 year ago

@boolangery to confirm, was this a Go or Python pipeline?

chleech commented 1 year ago

I’ve been experiencing the same issue. To validate, we also set up a pipeline that only reads from a single sub, ran it for 2 weeks and the mem is constantly increasing.

Got a response from DF team and their suggestion was to try 2.46.0. Will update here once we manage to test.

boolangery commented 1 year ago

@boolangery to confirm, was this a Go or Python pipeline?

A Go one

boolangery commented 1 year ago

Issue has been created: https://issuetracker.google.com/issues/297918533

lostluck commented 1 year ago

Adding the following service option when starting the job will let you get / provide CPU and HEAP profiles of the SDK worker in dataflow:

--dataflow_service_options=enable_google_cloud_profiler

From https://cloud.google.com/dataflow/docs/guides/profiling-a-pipeline#enable_for_pipelines

tvalentyn commented 1 year ago

FYI, we have observed a memory leak in Python SDK, which we correlated with a protobuf dependency upgrade: https://github.com/apache/beam/issues/28246. This issue may or may not be similar in nature.

kennknowles commented 1 year ago

If this makes the Go SDK unusable in 2.49.0 and beyond then per https://beam.apache.org/contribute/issue-priorities/ I would agree with P1. If it is usable in some cases then P2 is appropriate.

kennknowles commented 1 year ago

And if P1 it should not be unassigned and should have ~daily updates and block releases.

boolangery commented 9 months ago

This issue is still here in 2.53