[GLUTEN-7953][VL] Fetch and dump all inputs for micro benchmark on middle stage begin

apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

https://gluten.apache.org/

Apache License 2.0

1.22k stars 437 forks source link

[GLUTEN-7953][VL] Fetch and dump all inputs for micro benchmark on middle stage begin #7998

Open marin-ma opened 2 days ago

marin-ma commented 2 days ago

Collect all input data and save it into a Parquet file. Then, read the data from the Parquet file to feed it into the pipeline.

Update spark.gluten.sql.benchmark_task.partitionId and spark.gluten.sql.benchmark_task.taskId to accept a comma-separated string of multiple partition ids/task ids

github-actions[bot] commented 2 days ago

https://github.com/apache/incubator-gluten/issues/7953

Yohahaha commented 2 days ago

does this PR only works for middle stage? or dump the data comes from shuffle?

marin-ma commented 2 days ago

@Yohahaha Yes. It only works for middle stage. Updated PR titile.

zhztheplayer commented 2 days ago

@marin-ma Curious what're the improvements here, comparing to https://github.com/apache/incubator-gluten/pull/725? Thanks.

marin-ma commented 2 days ago

@marin-ma Curious what're the improvements here, comparing to #725? Thanks.

@zhztheplayer Currently, the input files are being dumped during the middle stage of execution. This PR will fetch all the data and save it into file before the pipeline starts. The stage input will then be read from this dumped file. This will help to save a complete input file even if the task fails. In this way, we can have the full input data of the failed task, and reproduce the failure with microbenchmark.

github-actions[bot] commented 1 day ago

Run Gluten Clickhouse CI on x86

github-actions[bot] commented 20 hours ago

Run Gluten Clickhouse CI on x86