apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.92k stars 1.12k forks source link

FilterExec should not return Non Statistics when it can not calculate the predicate selectivity #4372

Open mingmwang opened 1 year ago

mingmwang commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

https://github.com/apache/arrow-datafusion/blob/58b43f5c0b629be49a3efa0e37052ec51d9ba3fe/datafusion/core/src/physical_plan/filter.rs#L173-L193

Should return input_stats

To Reproduce Steps to reproduce the behavior:

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

mingmwang commented 1 year ago

@isidentical

isidentical commented 1 year ago

Without being able to estimate the selectivity, falling back into accepting that the filter always selects feels like would be worse than simply returning not known (which was the default case even before we worked on the filter selectivity analysis). Is there a particular case where propagating input_stats would be useful?

mingmwang commented 1 year ago

I think it is better to propagate it, or we can have a default predicate selectivity(many other DB systems take this approach). Otherwise the Statistics estimation system will be very brittle, the high level operates can not derive the stats.

mingmwang commented 1 year ago

And is there any reason that the total_byte_size is not adjusted accordingly based on the computed predicate selectivity?

isidentical commented 1 year ago

I think it is better to propagate it, or we can have a default predicate selectivity(many other DB systems take this approach).

I'd personally lean towards not propagating due to how unreliable the input is without having any idea about what the filter does to it. For the secondary option (setting a default selectivity), it is a bit better that we are assuming the filter has some sort of an effect and might be a viable alternative (would love to hear the figures/conditions other DB systems use for this, and maybe we can do something similar).

isidentical commented 1 year ago

And is there any reason that the total_byte_size is not adjusted accordingly based on the computed predicate selectivity?

Created #4374!

mingmwang commented 1 year ago

@jackwener Do you know that in Doris, is that a default value for Filter selectivity?

mingmwang commented 1 year ago

@isidentical

Presto/Trino, looks like it is default to 0.9

booleanProperty(
         DEFAULT_FILTER_FACTOR_ENABLED,
         "use a default filter factor for unknown filters in a filter node",
        optimizerConfig.isDefaultFilterFactorEnabled(),
                        false),

https://github.com/trinodb/trino/blob/0f71007ecb480384a9c443ba883f4bc4d896df83/core/trino-main/src/main/java/io/trino/cost/FilterStatsCalculator.java#L90-L93

SparkSQL, looks like it is default to 1.0

def calculateFilterSelectivity(condition: Expression, update: Boolean = true): Option[Double] = {
    condition match {
      case And(cond1, cond2) =>
        val percent1 = calculateFilterSelectivity(cond1, update).getOrElse(1.0)
        val percent2 = calculateFilterSelectivity(cond2, update).getOrElse(1.0)
        Some(percent1 * percent2)

      case Or(cond1, cond2) =>
        val percent1 = calculateFilterSelectivity(cond1, update = false).getOrElse(1.0)
        val percent2 = calculateFilterSelectivity(cond2, update = false).getOrElse(1.0)
        Some(percent1 + percent2 - (percent1 * percent2))

      // Not-operator pushdown
      case Not(And(cond1, cond2)) =>
        calculateFilterSelectivity(Or(Not(cond1), Not(cond2)), update = false)

      // Not-operator pushdown
      case Not(Or(cond1, cond2)) =>
        calculateFilterSelectivity(And(Not(cond1), Not(cond2)), update = false)

      // Collapse two consecutive Not operators which could be generated after Not-operator pushdown
      case Not(Not(cond)) =>
        calculateFilterSelectivity(cond, update = false)

      // The foldable Not has been processed in the ConstantFolding rule
      // This is a top-down traversal. The Not could be pushed down by the above two cases.
      case Not(l @ Literal(null, _)) =>
        calculateSingleCondition(l, update = false).map(boundProbability(_))

      case Not(cond) =>
        calculateFilterSelectivity(cond, update = false) match {
          case Some(percent) => Some(1.0 - percent)
          case None => None
        }

      case _ =>
        calculateSingleCondition(condition, update).map(boundProbability(_))
    }
  }