apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.55k stars 1.03k forks source link

Implement DECIMAL type #122

Open alamb opened 3 years ago

alamb commented 3 years ago

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10818

The TPC-H benchmarks correctly specify that all MONEY columns are DECIMAL type (precision and scale are not specified). We currently use DataType::Float64 which is much lighter than a true Decimal type.

To be a valid benchmark we need to ensure we support the same precision as the reference implementation.

j-a-m-l commented 2 years ago

@alamb any idea about how to start this?

xudong963 commented 2 years ago

Hi @j-a-m-l, I see arrow-rs has supported Decimal, maybe we can directly change here https://github.com/apache/arrow-datafusion/blob/22a7d74e9dece44adb790b5ea3fca27d26a2fe51/datafusion/src/sql/planner.rs#L312 cc @alamb

alamb commented 2 years ago

@j-a-m-l I think a @xudong963 has a good point that Arrow already includes some support for DecimalArrays.

A good place to start then be to write some tests in sql.rs that create a TableProvider with a column of DecimalArray type and try to run some basic sql

The typical order I would try is something like

// projection
SELECT col + 4 ... FROM table_with_decimal_col

// filter
SELECT col  FROM table_with_decimal_col where col < 5

// sort
SELECT col  FROM table_with_decimal_col order by col

// grouping / aggregatie
SELECT sum(col)  FROM table_with_decimal_col;
SELECT count(other_col)  FROM table_with_decimal_col group by col

Writing those tests is probably a good way to figure out how to plumb through the support in DataFusion -- I would also recommend doing it in that order (projection, filter, sort, grouping / aggregation)

j-a-m-l commented 2 years ago

Thanks @xudong963 and @alamb

liukun4515 commented 2 years ago

Is there any plan for implementation Decimal Type for datafusion? @alamb

xudong963 commented 2 years ago

Hi @j-a-m-l, do you have time to work on this?

liukun4515 commented 2 years ago

Is there any plan for implementation Decimal Type for datafusion? @alamb

I want to take some tasks to support the decimal data type in the datafusion.

alamb commented 2 years ago

Is there any plan for implementation Decimal Type for datafusion?

I don't know of any plans other than what is on this ticket. As I mentioned in https://github.com/apache/arrow-datafusion/issues/122#issuecomment-956212828 I think figuring out what subtasks are needed (by writing some tests and trying out what works / doesn't work) is a good first step

j-a-m-l commented 2 years ago

No, @xudong963. Currently I don't have enough time to do that, and I'll probably focus first on fixing other issues that I've found, such as https://github.com/apache/arrow-datafusion/issues/1173.

alamb commented 2 years ago

Here are several other issues that seem to be related to DECIMAL:

liukun4515 commented 2 years ago

Thanks for the feedback. I will begin to support DECIMAL datatype. Please assign this issue to me @alamb

liukun4515 commented 2 years ago

utils:

liukun4515 commented 2 years ago

In order to implement decimal in datafusion, we should ref some data type rules, like spark data type and type coercion. Now, the numerical data type just support some native type and the type coercion rules are simple. if we add the decimal data type, the type checking, type cast, type promotion and precision checking are all need to be considered.

alamb commented 2 years ago

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/expressions/coercion.rs and https://github.com/apache/arrow-datafusion/blob/4687899957463ce81c4795a6d35d31320db0252b/datafusion/src/physical_plan/type_coercion.rs are the two places I know of in DataFusion today that do this type coercion

On Mon, Nov 22, 2021 at 9:33 PM Kun Liu @.***> wrote:

In order to implement decimal in datafusion, we should ref some data type rules, like spark data type https://spark.apache.org/docs/3.2.0/sql-ref-datatypes.html and type coercion https://spark.apache.org/docs/3.2.0/sql-ref-ansi-compliance.html#type-promotion-and-precedence . Now, the numerical data type just support some native type and the type coercion rules are simple. if we add the decimal data type, the type checking, type cast, type promotion and precision checking are all need to be considered.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/arrow-datafusion/issues/122#issuecomment-976109767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADXZMNH25DW2NQZKCQ75BDUNL4OBANCNFSM43S4FY3Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

liukun4515 commented 2 years ago

I know of in DataFusion today that do this type coercion

yep, Now I just try to support some simple demo code for decimal. For example function(min,max) and just Projection the column with decimal data type (without any operation and functions)

The rule of conversion in the above two files should be refactored, If we want to align the behavior with PG database exactly.

In the PG database, decimal and other numerical data type can be convert to each other, and in some case the varchar also can be convert to a decimal data type. In my opinion, we can support some simple and right rules first, and support other complex features later. For example, In the add mathematics operation for the decimal datatype, if left or right is the decimal datatype, we must require that two size are all decimal type. It is means that we don't support the cast or try cast to or from decimal data type implicitly.

What do you think about? @alamb

alamb commented 2 years ago

In my opinion, we can support some simple and right rules first, and support other complex features later

I think this is a great idea -- taking stock / consolidating the existing coercion logic might be a good place to start (before you try to extend it to support Decimal)