Open mingmwang opened 1 year ago
@isidentical
Without being able to estimate the selectivity, falling back into accepting that the filter always selects feels like would be worse than simply returning not known (which was the default case even before we worked on the filter selectivity analysis). Is there a particular case where propagating input_stats
would be useful?
I think it is better to propagate it, or we can have a default predicate selectivity(many other DB systems take this approach). Otherwise the Statistics estimation system will be very brittle, the high level operates can not derive the stats.
And is there any reason that the total_byte_size
is not adjusted accordingly based on the computed predicate selectivity
?
I think it is better to propagate it, or we can have a default predicate selectivity(many other DB systems take this approach).
I'd personally lean towards not propagating due to how unreliable the input is without having any idea about what the filter does to it. For the secondary option (setting a default selectivity), it is a bit better that we are assuming the filter has some sort of an effect and might be a viable alternative (would love to hear the figures/conditions other DB systems use for this, and maybe we can do something similar).
And is there any reason that the total_byte_size is not adjusted accordingly based on the computed predicate selectivity?
Created #4374!
@jackwener Do you know that in Doris, is that a default value for Filter selectivity?
@isidentical
Presto/Trino, looks like it is default to 0.9
booleanProperty(
DEFAULT_FILTER_FACTOR_ENABLED,
"use a default filter factor for unknown filters in a filter node",
optimizerConfig.isDefaultFilterFactorEnabled(),
false),
SparkSQL, looks like it is default to 1.0
def calculateFilterSelectivity(condition: Expression, update: Boolean = true): Option[Double] = {
condition match {
case And(cond1, cond2) =>
val percent1 = calculateFilterSelectivity(cond1, update).getOrElse(1.0)
val percent2 = calculateFilterSelectivity(cond2, update).getOrElse(1.0)
Some(percent1 * percent2)
case Or(cond1, cond2) =>
val percent1 = calculateFilterSelectivity(cond1, update = false).getOrElse(1.0)
val percent2 = calculateFilterSelectivity(cond2, update = false).getOrElse(1.0)
Some(percent1 + percent2 - (percent1 * percent2))
// Not-operator pushdown
case Not(And(cond1, cond2)) =>
calculateFilterSelectivity(Or(Not(cond1), Not(cond2)), update = false)
// Not-operator pushdown
case Not(Or(cond1, cond2)) =>
calculateFilterSelectivity(And(Not(cond1), Not(cond2)), update = false)
// Collapse two consecutive Not operators which could be generated after Not-operator pushdown
case Not(Not(cond)) =>
calculateFilterSelectivity(cond, update = false)
// The foldable Not has been processed in the ConstantFolding rule
// This is a top-down traversal. The Not could be pushed down by the above two cases.
case Not(l @ Literal(null, _)) =>
calculateSingleCondition(l, update = false).map(boundProbability(_))
case Not(cond) =>
calculateFilterSelectivity(cond, update = false) match {
case Some(percent) => Some(1.0 - percent)
case None => None
}
case _ =>
calculateSingleCondition(condition, update).map(boundProbability(_))
}
}
Describe the bug A clear and concise description of what the bug is.
https://github.com/apache/arrow-datafusion/blob/58b43f5c0b629be49a3efa0e37052ec51d9ba3fe/datafusion/core/src/physical_plan/filter.rs#L173-L193
Should return
input_stats
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Additional context Add any other context about the problem here.