apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.78k stars 4.22k forks source link

Microbenchmarks for windowinto #19084

Open kennknowles opened 2 years ago

kennknowles commented 2 years ago

Add microbenchmarks for the windowinto transform:

R: [~tvalentyn]

Imported from Jira BEAM-4855. Original Jira may contain additional context. Reported by: matthiasml6.

Abacn commented 1 year ago

The performance of WindowInto may worth investigation as I noticed that Python text IO write has worse performance than Java SDK, and the slowest DoFn is WindowInto(GlobalWindows()):

Java metrics: http://104.154.241.245/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1&viewPanel=4 Python metrics: http://104.154.241.245/d/gP7vMPqZz/python-io-it-tests-dataflow?orgId=1&viewPanel=5

Java Read ~20s; Java Write ~30s; Python Read ~100s; Python Write 270s

Two noticable difference from job graph

The Java write pipeline graph looks like this:

image

The Python write pipeline graph looks like this:

image