[Bug]: GenerateSequence is broken on SDF

pabloem commented 1 year ago

What happened?

The transform GenerateSequence.from(0).to(...).withRate(...) that is expanded into a streaming source (that is bounded, as there's a maximum number of records read) is broken after it was migrated from an UnboundedSource based implementaion.

For smaller data sizes, the source works as expected, and a sequence is generated properly.

For larger data sizes, the transform causes workers to OOM and never make progress.

Reasons:

Firstly, the maximum number of initial splits is 100: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedReadFromUnboundedSource.java#L157-L161
- This makes it such that each individual split can be very large for medium data sizes (over 100k elements per split).
Secondly, the consumption is implemented as a simple DoFn that consumes the whole source without checkpointing halfway: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedReadFromUnboundedSource.java#L189-L212

This is what causes ooms for workers before they can make any progress.

(here's the expansion for GenerateSequence: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/GenerateSequence.java#L242-L258)

I found that the expansion always became a BoundedReadFromUnboundedSource if we had from and to parameters.

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

[ ] Component: Python SDK
[X] Component: Java SDK
[ ] Component: Go SDK
[ ] Component: Typescript SDK
[X] Component: IO connector
[ ] Component: Beam examples
[ ] Component: Beam playground
[ ] Component: Beam katas
[ ] Component: Website
[ ] Component: Spark Runner
[ ] Component: Flink Runner
[ ] Component: Samza Runner
[ ] Component: Twister2 Runner
[ ] Component: Hazelcast Jet Runner
[X] Component: Google Cloud Dataflow Runner

pabloem commented 1 year ago

@lukecwik could you take a quick look at this and let me know if my assessment is correct?

lukecwik commented 1 year ago

The expectation is that the runner will checkpoint the bundle allowing the SDF to be checkpointed if it detects that the bundle has been running long enough or has done enough meaningful work.

Also the conversion only outputs one record at a time, is there a DoFn buffering all the intermediate output downstream?

lukecwik commented 1 year ago

After looking through the code the issue is that the https://github.com/apache/beam/blob/6ba647333c4c69fb6dfc65929456c7c11570382f/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedReadFromUnboundedSource.java#L184 is a normal DoFn which forces outputting all the records that are part of the split. This doesn't support checkpointing which causes OOMs in Dataflow since it buffers all the output for the bundle.

https://github.com/apache/beam/blob/6ba647333c4c69fb6dfc65929456c7c11570382f/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L455 is an SDF wrapper for unbounded sources which could be useful as part of a solution here.

kennknowles commented 1 year ago

If this has never worked on SDF, is it P1? It does seem like trouble that it cannot be used in portable runners in some cases. How widespread is the problem it causes?

lukecwik commented 1 year ago

It is primarily uses for benchmarks and slowly changing side input patterns.

pabloem commented 1 year ago

this is only for streaming GenerateSequence with a to and a from parameter. And for very large sequences. If it's being used for slowly changing side inputs, it usually doesn't have a to parameter, (I think?) so in that case it is not broken

kennknowles commented 1 year ago

I would say since this has been broken for a long time without too much complaint P2 should be OK. We don't need it included in our alert email of urgent things being neglected.

apache / beam