apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.81k stars 4.23k forks source link

[Task]: Improve python statecache #28802

Open AnandInguva opened 1 year ago

AnandInguva commented 1 year ago

What needs to happen?

Initially for python sdk, we will enable the statecache size from 0 MB to 100 MB. Then there are some improvements that could be made on the statecache. For example,

The Java implementation for the cache is in: https://github.com/apache/beam/blob/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/Caches.java And most of the caching complexity is within: https://github.com/apache/beam/blob/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/StateFetchingIterators.java With the views over these caches doing specific view level operations (e.g. merging old view of data with in-memory updates). Generally understanding the code in https://github.com/apache/beam/tree/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state should provide most answers.

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

AnandInguva commented 1 year ago

cc: @tvalentyn

AnandInguva commented 1 year ago

AsIter view_fn

Iterable might look one element at a time and this could be more for the side input cache on the GCS bucket?

AsList view fn List materializes so we wouldn’t need too many reads from the side input cache at GCS bucket?

For AsIter with state_cache_size=100 mb,

tvalentyn commented 9 months ago

State cache was enabled in https://github.com/apache/beam/issues/28770 .