apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.93k stars 1.12k forks source link

Implement statistics support for Substrait #8698

Open alamb opened 8 months ago

alamb commented 8 months ago

Is your feature request related to a problem or challenge?

A report from Twitter https://twitter.com/mim_djo/status/1740542585410814393

Says:

a new release of #datafusion 34, still reading #Deltatable via arrow is suboptimal compared to reading Parquet Directly :( something to do with passing stats to get correct join orders.

image

I think the issue is that https://github.com/apache/arrow-datafusion/issues/7949 and https://github.com/apache/arrow-datafusion/issues/7950 rely on statistics to pick non bad join orders for TPCH queries.

These statistics are not available from the delta provider it seems.

@andygrove says

RelCommon (common to all operators in Substrait) can contain a hint that has stats

 message Stats {
      double row_count = 1;
      double record_size = 2;
      substrait.extensions.AdvancedExtension advanced_extension = 10;
    }

Describe the solution you'd like

I would like the Datafusion substrait consumer/producer to handle translating

Describe alternatives you've considered

No response

Additional context

This was brought up by @Dandandan on the ASF slack: https://the-asf.slack.com/archives/C04RJ0C85UZ/p1703885214702039

xinlifoobar commented 2 months ago

Hi @alamb

I am looking into this issue and have a proposal for the following PR. The support for substrait statistics is quite basic compared to that of DataFusion. For example, Substrait only supports table-level statistics without precision and does not support column-level statistics. Here is the current Substrait Stats message:

message Stats {  
  double row_count = 1;  
  double record_size = 2;  
  substrait.extensions.AdvancedExtension advanced_extension = 10;  
}  

In contrast, DataFusion's statistics are more detailed:

#[derive(Debug, Clone, PartialEq, Eq)]  
pub struct Statistics {  
  /// The number of table rows.  
  pub num_rows: Precision,  
  /// Total bytes of the table rows.  
  pub total_byte_size: Precision,  
  /// Statistics on a column level. It contains a [`ColumnStatistics`] for  
  /// each field in the schema of the table to which the [`Statistics`] refer.  
  pub column_statistics: Vec<ColumnStatistics>,  
}  

To enhance the support of statistics in Substrait, I propose adding an AdvancedExtension to the Stats message. This extension is defined as follows:

// A generic object that can be used to embed additional extension information  
// into the serialized Substrait plan.  
message AdvancedExtension {  
  // An optimization is helpful information that doesn't influence semantics. May  
  // be ignored by a consumer.  
  google.protobuf.Any optimization = 1;  

  // An enhancement alters semantics. Cannot be ignored by a consumer.  
  google.protobuf.Any enhancement = 2;  
}  

I would add a new type in the datafusion-proto to define the new message with all the necessary fields. The new message is defined as:

message DatafusionStatsExtension {  
  // The version of the extension.  
  int32 version = 1;  

  // The statistics.  
  datafusion_common.Statistics statistics = 6;  
}  

On the producer side, it will try to encode the DatafusionStatsExtension message and attach it to the AdvancedExtension as an optimization.

On the consumer side, it will try to parse the AdvancedExtension message, extract the DatafusionStatsExtension message, and update the Stats message accordingly. If the DatafusionStatsExtension message is not present, it will treat the num_rows and total_byte_size as table-level statistics with Precision::EXACT.

What do you think about this idea? I'd like to hear your thoughts on this.

Thanks.

alamb commented 2 months ago

Hi @xinlifoobar -- I would personally suggest we don't try to encode statistics yet in Substrait because:

  1. We may want to change Statistics in DataFusion in the future
  2. I don't know of a usecase yet for passing along more detailed statistics in substrait. If the source and target system were DataFusion it is probably better to use protobuf. If one of the source / target are NOT DataFusion then the statistics information is not going to be useful

cc @Blizzara for your thoughts

alamb commented 2 months ago

So in other words, can we simply handle importing the basic statistics and not try to handle column level statistics?

xinlifoobar commented 2 months ago

So in other words, can we simply handle importing the basic statistics and not try to handle column level statistics?

Hi @alamb, as attached in #9347, the complete physical plans including the column level statistics, that's why I may want to encode it. There are other ways around it, though, do you suggest doing an additional read while the consumer consumes a ParquetExec plan?

alamb commented 2 months ago

🤔 If the goal is to get the round trip working, then I think you have no other choice likely than to try and encode the statistics

Another potential idea would be to update the code that checks the plans for being equal after conversion to strip out the statistics before comparisons (aka "normalize the plans")

alamb commented 2 months ago

I just worry about adding special case code that only works when datafusion is both the consumer and producer of the substrait plan, as I think that case is not super likely in practice