Currently if a task failed, the reducer stopped to read data and output to parquet. So the reducer data isn't completed. We need a way to read full data once the partition is enabled.
currently we filter sample by stage id and task id. we need to add filter by partition size as well. @marin-ma can we get the records number from driver before reducer read the data? I'd think so.
@marin-ma figured out that if input data is the same, the partition id and partition data can be reproducable run2run. So we needn't 2. Partition id is the "index" column shown in UI.
Description