"assumeGrouped" behaves differently in native batch and hadoop tasks

jihoonson commented 4 years ago

Affected Version

0.17.0 and master branches

Description

assumeGrouped is a property supported with the single-dimension based range partitioning. The property was added for Hadoop ingestion first which is to accelerate the ingestion speed by skipping the first job to determine partitions when the input data is already partitioned.

The single-dimension based range partitioning was added in #8769 for parallel task, but the parallel task runs in 3 phases no matter what assumeGrouped is. Instead, if it's set, the task uses PassthroughRowDimensionValueFilter which assumes the input rows are unique.

I think this is a bug since the behavior of the property is supposed to be same in both Hadoop and native parallel ingestion, but don't think it's a release blocker for 0.17.0

ccaominh commented 4 years ago

I believe that for hadoop ingestion, regardless of the value of assumeGrouped, DeterminePartitionsJob.DeterminePartitionsDimSelectionReducer always runs and examines the dimension value distribution to determine the partitions. When assumeGrouped is false, there is an earlier stage the data to group rows (DeterminePartitionsGroupByMapper/DeterminePartitionsGroupByReducer).

For native batch ingestion range partitioning, instead of having an earlier stage to group rows, the grouping occurs in the same stage that determines the partitions (i.e., both grouping and partitioning are done with a single pass over the data instead of the two used for hadoop ingestion). Having assumeGrouped as true still benefits native batch ingestion range partitioning since it avoids the time/space overhead of using a bloom filter to group the rows.

In short, the behavior of assumeGrouped is not identical between range partitioning for hadoop and native batch ingestion, but it does have a similar effect in improving ingestion performance when set to true.

jihoonson commented 4 years ago

Thanks for the comparison. I'm still not sure how assumeGrouped improves ingestion performance in native batch ingestion. Would you elaborate more on it? What performance does it improve?

ccaominh commented 4 years ago

When assumeGrouped is true, range partitioning for native batch does not have to hash dimension values for bloom filter inserts/tests. There's also some reduction in memory usage as the bloom filter does not need to be created. Both of these should help PartialDimensionDistributionTask run slightly faster than when assumeGrouped is false.

jihoonson commented 4 years ago

Thanks. Do you have know roughly how it can be faster when assumeGrouped = true? I'm curious because assumeGrouped makes the hadoop ingestion pretty fast by skipping the first job. Druid users will probably expect the similar performance improvement.

ccaominh commented 4 years ago

I haven't measured the performance difference for parallel indexing, but I suspect it won't be as dramatic as that for hadoop.

jihoonson commented 4 years ago

Thanks for all the answers. I think we need to think about what we should do with the assumeGrouped property for native batch ingestion if it doesn't make much difference in performance since it's mostly about faster ingestion. Maybe we can remove it for native batch. Or we can come up with a better algorithm to make it faster.

apache / druid

"assumeGrouped" behaves differently in native batch and hadoop tasks #9168

Affected Version

Description