apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.8k stars 4.22k forks source link

[Task]: Optimize Spark Runner parDo transform evaluator #32537

Open twosom opened 6 days ago

twosom commented 6 days ago

What needs to happen?

When evaluating ParDo operations in the TransformTranslator in Apache Spark Runner, too many filter operations are applied. The reason for applying filter operations is that a ParDo can have multiple outputs, so we apply filter operations to filter only elements such as each TupleTag.

However, the filter operation is also applied to a ParDo with a single output, which can have a performance impact. Therefore, we should avoid applying the filter operation when evaluating ParDo operations with a single output.

related mail context

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

twosom commented 6 days ago

.take-issue

tejasrok007 commented 6 days ago

Can you please elaborate what is needed in this I understood that we cant use filter options as it can have performance impact but we will have to change it totally so as to satisfy this requirement Can you guide me a little on this i think i can complete it.

twosom commented 6 days ago

Can you please elaborate what is needed in this I understood that we cant use filter options as it can have performance impact but we will have to change it totally so as to satisfy this requirement Can you guide me a little on this i think i can complete it.

@tejasrok007 Thanks for your comment. But I've already done the work and am testing it.