`Statistics::is_exact` semantics

crepererum commented 1 year ago

Describe the bug It is unclear what Statistics::is_exact = false means. The docs are here:

https://github.com/apache/arrow-datafusion/blob/a578150e63e344fbaa7d13eda58544482dea4729/datafusion/common/src/stats.rs#L34-L37

These state for this case:

may contain an inexact estimate and may not be the actual value

What does "inexact" mean? Some potential definitions (we only consider Some(...) fields here!):

underestimate: There are values within the data source that are NOT included within the statistics, i.e. the statistics do NOT cover the whole range. This could happen when you sample statistics from a larger data source.
overestimate: All values from the data stream are covered by the statistics, but the range might be too large. This can happen when some source doesn't fold predicates into the statistics (which in general is pretty hard to do).
both: The statistics are only a rough guide.

I think there is a pretty important difference between "overestimate" and "both", because the former allows you to prune execution branches or entire operations (e.g. sorts in some cases) while the latter can only be used to re-order operations (e.g. joins) or select a concrete operation from a pool (e.g. type of join).

Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources.

Expected behavior Clarify behavior.

Additional context Cross-ref #997.

alamb commented 1 year ago

both: The statistics are only a rough guide.

I think this is the best that we can get, for the reason you cite.

Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources.

Specifically, I think "inexact" means "best effort" but can not be relied on

cc @Dandandan @isidentical @metesynnada

crepererum commented 1 year ago

@alamb I think "overestimate" would be possible to achieve in many cases and quite helpful.

alamb commented 1 year ago

🤔 I see -- I can imagine the ranges being an over estimate (being at least as large as the actual range).

I wonder how an "overestimate" would apply to num_rows. Unless we knew the distribution exactly, in order to preserve an overestimate in num_rows, wouldn't we have to assume no rows were filtered ?

crepererum commented 1 year ago

I wonder how an "overestimate" would apply to num_rows. Unless we knew the distribution exactly, in order to preserve an overestimate in num_rows, wouldn't we have to assume no rows were filtered ?

I guess if the ranges (min/max) are overestimated / too wide, then the number of rows is likely an overestimate as well (upper bound).

Thinking about that more since this is getting really confusing with min/max/row_count/n_bytes because "overestimate" for "min" is the lower bound while of "max" it's the upper bound. So #997 already suggest to rework this attribute to be field-specific. I would propose to extend the interface even further:

struct Boundary<T: PartialOrd> {
    pub val: T,
    pub is_lower_bound: bool,
    pub is_upper_bound: bool,
}

impl<T: PartialOrd> Boundary<T> {
    pub fn is_exact(&self) -> bool {
        self.is_lower_bound && self.is_upper_bound
    }
}

pub struct Statistics {
    /// The number of table rows
    pub num_rows: Option<Boundary<usize>>,
    /// total bytes of the table rows
    pub total_byte_size: Option<Boundary<usize>>,
    /// Statistics on a column level
    pub column_statistics: Option<Vec<ColumnStatistics>>,
}

pub struct ColumnStatistics {
    /// Number of null values on column
    pub null_count: Option<Boundary<usize>>,
    /// Maximum value of column
    pub max_value: Option<Boundary<ScalarValue>>,
    /// Minimum value of column
    pub min_value: Option<Boundary<ScalarValue>>,
    /// Number of distinct values
    pub distinct_count: Option<Boundary<usize>>,
}

impl ColumnStatistics {
    pub fn min_max_exact(&self) -> bool {
        self.min_value.map(|b| b.is_exact()).unwrap_or_default()
        && self.max_value.map(|b| b.is_exact()).unwrap_or_default()
    }

    /// Does the range described by min-max contain ALL values?
    ///
    /// Note that the range might be too large. Some filters may not 
    /// have be considered when this range was determined.
    pub fn min_max_countains_all(&self) -> bool {
        self.min_value.map(|b| b.is_lower_bound).unwrap_or_default()
        && self.max_value.map(|b| b.is_upper_bound).unwrap_or_default()
    }

    /// Does the range described by min-max contain actual data?
    ///
    /// Note that there might be values outside of this range, esp. when the
    /// statistics were constructed using sampling.
    pub fn min_max_guaranteed_to_contain_value(&self) -> bool {
        self.min_value.map(|b| b.is_upper_bound).unwrap_or_default()
        && self.max_value.map(|b| b.is_lower_bound).unwrap_or_default()
    }
}

Note that the exact interface and names are TBD, but it's a rough idea. Also there might be similar interfaces in the pruning predicates and analysis passes, so maybe the Boundary struct can be reused.

alamb commented 1 year ago

so maybe the Boundary struct can be reused.

I agree this seems a better fit than trying to extend the Statistics interface. It turns out there are three different ways to do boundary analysis that I know of -- see https://github.com/apache/arrow-datafusion/issues/5535 perhaps one of them is sufficient for your usecase (I am not 100% clear what that is) 🤔

apache / datafusion

`Statistics::is_exact` semantics #5613