apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.47k stars 1.01k forks source link

Add example for writing an `AnalyzerRule` #10855

Open alamb opened 2 weeks ago

alamb commented 2 weeks ago

Is your feature request related to a problem or challenge?

We have an example for writing a user defined optimizer rule in

https://github.com/apache/datafusion/blob/3773fb7fb54419f889e7d18b73e9eb48069eb08e/datafusion/core/tests/user_defined/user_defined_plan.rs#L251

However, we don't have a corresponding example for writing a user defined AnalyzerRule which also means the APIs for using them are complicated (see https://github.com/apache/datafusion/pull/10849 for example)

Describe the solution you'd like

The idea example I think would be ti Add a file to https://github.com/apache/datafusion/tree/3773fb7fb54419f889e7d18b73e9eb48069eb08e/datafusion-examples

user_defined_analyzer.rs

Perhaps the example could show how to create an AnalyzerRule that replaced an experssion like a / b with a function call safe_div(a, b) where save_div

Describe alternatives you've considered

No response

Additional context

No response

alamb commented 2 weeks ago

Perhaps @goldmedal is interested in this / has some example code to share

goldmedal commented 2 weeks ago

I think we have an existing example showing how to create an AnalyzerRule in https://github.com/apache/datafusion/blob/3773fb7fb54419f889e7d18b73e9eb48069eb08e/datafusion-examples/examples/rewrite_expr.rs#L82

However, it only works with the analyzer and the optimizer. Maybe we can enhance it after #10849 to show how to apply the custom rules for the end-to-end DataFusion query flow.

goldmedal commented 2 weeks ago

Currently, in my personal work, I just use it like

let ctx = SessionContext::new();
let new_state = ctx
    .state()
    .add_analyzer_rule(Arc::new(ModelAnalyzeRule::new()))
    .add_analyzer_rule(Arc::new(ModelGenerationRule::new()));
let new_ctx = SessionContext::new_with_state(new_state);
// create a plan to run a SQL query
let df = new_ctx.sql("SELECT * FROM datafusion.default.orders").await?;

After the API updated, I guess it would be

let ctx = SessionContext::new();
ctx.add_analyzer_rule(Arc::new(ModelAnalyzeRule::new()))
    .add_analyzer_rule(Arc::new(ModelGenerationRule::new()));
// create a plan to run a SQL query
let df = ctx.sql("SELECT * FROM datafusion.default.orders").await?;

After #10849 is done, if no one else is working on it, I think I can help with it.

alamb commented 1 week ago

I have some time on a plane today that I may use to try and write up this example. I was inspired by some discussion I had this week

goldmedal commented 1 week ago

Hi @alamb, I just want to share how I use user-defined AnalyzerRule. As you know, I'm working on reimplementing the semantic layer engine, wren-engine, for LLM using DataFusion. The project is still a work in progress. However, I think I have finished the part of integration with DataFusion. I think it could be a nice use case for AnalyzerRule.

The basic concept of wren-engine is that user can define a virtual modeling layer to apply on their physical data. For example, given a csv file registered in DataFusion context called, datafusion.public.customers. We can wrap a virtual table for it called, wrenai.default.customers_model. User can query it like

select * from wrenai.default.customers_model

Then , wren engine translate it to a physical query:

SELECT 
  "customers_model"."city", 
  "customers_model"."id", 
  "customers_model"."state" 
FROM 
  (
    SELECT 
      "customers_model"."city", 
      "customers_model"."id", 
      "customers_model"."state" 
    FROM 
      (
        SELECT 
          "datafusion"."public"."customers"."city" AS "city", 
          "datafusion"."public"."customers"."id" AS "id", 
          "datafusion"."public"."customers"."state" AS "state" 
        FROM 
          "public"."customers"
      ) AS "customers_model"
  ) AS "customers_model"

It's a simple use case to show how it works. We have other features, such as relationship between models or calculated fields in the model. I don't go into detail about them here. However, I implemented all of them through AnalyzerRule and UserDefinedLogicalNode.

The related PR, https://github.com/Canner/wren-engine/pull/613, is still under review. However, you can find the usage of AnalyzerRule in rule.rs and the plan node in plan.rs. I also added some examples to show how to use the API or integrate with the DataFusion query flow.

Thanks to DataFusion for providing such amazing features, allowing me to implement a more stable and structured converter for the semantic layer.

alamb commented 1 week ago

Thank you @goldmedal - this is a great hint. I have some basic analyzer rule coded up but maybe your example is a better idea. I have a PR in progress that I need to polish up and then will ping you for review

alamb commented 5 days ago

Here is my proposed analyzer rule example: https://github.com/apache/datafusion/pull/11089

It isn't quite as fancy as the wren example, but I think it is an understandable enough example