Open marin-ma opened 2 days ago
does this PR only works for middle stage? or dump the data comes from shuffle?
@Yohahaha Yes. It only works for middle stage. Updated PR titile.
@marin-ma Curious what're the improvements here, comparing to https://github.com/apache/incubator-gluten/pull/725? Thanks.
@marin-ma Curious what're the improvements here, comparing to #725? Thanks.
@zhztheplayer Currently, the input files are being dumped during the middle stage of execution. This PR will fetch all the data and save it into file before the pipeline starts. The stage input will then be read from this dumped file. This will help to save a complete input file even if the task fails. In this way, we can have the full input data of the failed task, and reproduce the failure with microbenchmark.
Run Gluten Clickhouse CI on x86
Run Gluten Clickhouse CI on x86
Collect all input data and save it into a Parquet file. Then, read the data from the Parquet file to feed it into the pipeline.
Update
spark.gluten.sql.benchmark_task.partitionId
andspark.gluten.sql.benchmark_task.taskId
to accept a comma-separated string of multiple partition ids/task ids