apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.77k stars 4.21k forks source link

[Task]: Improve the performance of Python Synthetic Source #25944

Closed Abacn closed 1 year ago

Abacn commented 1 year ago

What needs to happen?

It is found that the cost of generating synthetic source is as expensive as write to sink in a Python IO performance test https://github.com/apache/beam/issues/19084#issuecomment-1343373709 . This prevents the benchmark from reporting accurate performance data.

Ran a pipeline with synthetic source only, cloud profile shows

image

This is because Python built in random generator uses a Mersenne Twister with fairly large state ((doc)[https://docs.python.org/3/library/random.html]), thus assigning seed is slow. Generating bytes is also slow as it involves many memory allocations. In contrast, Java built in random generator (used by Java SDK's synthetic source) uses a linear congruential generator (LCG) by Donald Knuth ((doc)[https://docs.oracle.com/javase/8/docs/api/java/util/Random.html]) which is way faster.

I compared the performance between builtin generating random bytes and cythonized LCG implemenration, generating 1M random bytes of 1024 bytes. The latter shows more than 10 x performance gain (run time 10 s / < 1 s). This doubles the performance of synthetic pipeline. We should be able to switch to the LCG

Once this is done Python synthetic pipeline has minimum cost of generating bytes themselves and can then be used to benchmarking the peformance of SDF.

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

lostluck commented 1 year ago

This is also true for the Go SDK's load tests. We'd largely be best off by dictating the random source generation alg for consistency in the synthetic sources as performance measures.

Switching to a cheaper RNG approach with worse RNG is among many changes to improve the Load Test metrics for the Go SDK.... https://github.com/apache/beam/pull/17698/files