Open jihoonson opened 4 years ago
I believe that for hadoop ingestion, regardless of the value of assumeGrouped
, DeterminePartitionsJob.DeterminePartitionsDimSelectionReducer
always runs and examines the dimension value distribution to determine the partitions. When assumeGrouped
is false, there is an earlier stage the data to group rows (DeterminePartitionsGroupByMapper
/DeterminePartitionsGroupByReducer
).
For native batch ingestion range partitioning, instead of having an earlier stage to group rows, the grouping occurs in the same stage that determines the partitions (i.e., both grouping and partitioning are done with a single pass over the data instead of the two used for hadoop ingestion). Having assumeGrouped
as true
still benefits native batch ingestion range partitioning since it avoids the time/space overhead of using a bloom filter to group the rows.
In short, the behavior of assumeGrouped
is not identical between range partitioning for hadoop and native batch ingestion, but it does have a similar effect in improving ingestion performance when set to true
.
Thanks for the comparison. I'm still not sure how assumeGrouped
improves ingestion performance in native batch ingestion. Would you elaborate more on it? What performance does it improve?
When assumeGrouped
is true
, range partitioning for native batch does not have to hash dimension values for bloom filter inserts/tests. There's also some reduction in memory usage as the bloom filter does not need to be created. Both of these should help PartialDimensionDistributionTask
run slightly faster than when assumeGrouped
is false
.
Thanks. Do you have know roughly how it can be faster when assumeGrouped
= true? I'm curious because assumeGrouped
makes the hadoop ingestion pretty fast by skipping the first job. Druid users will probably expect the similar performance improvement.
I haven't measured the performance difference for parallel indexing, but I suspect it won't be as dramatic as that for hadoop.
Thanks for all the answers. I think we need to think about what we should do with the assumeGrouped
property for native batch ingestion if it doesn't make much difference in performance since it's mostly about faster ingestion. Maybe we can remove it for native batch. Or we can come up with a better algorithm to make it faster.
Affected Version
0.17.0 and master branches
Description
assumeGrouped
is a property supported with the single-dimension based range partitioning. The property was added for Hadoop ingestion first which is to accelerate the ingestion speed by skipping the first job to determine partitions when the input data is already partitioned.The single-dimension based range partitioning was added in #8769 for parallel task, but the parallel task runs in 3 phases no matter what
assumeGrouped
is. Instead, if it's set, the task usesPassthroughRowDimensionValueFilter
which assumes the input rows are unique.I think this is a bug since the behavior of the property is supposed to be same in both Hadoop and native parallel ingestion, but don't think it's a release blocker for 0.17.0