apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.86k stars 1.11k forks source link

[EPIC] Add Decimal support #3523

Open liukun4515 opened 1 year ago

liukun4515 commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and why for this feature, in addition to the what)

I have implemented the feature about decimal in the datafusion. But didn't take care about some special case. Specially when do the arithmetic operation:

  1. loss of the precision, In many case, the intermediate result will be overflow, So we use the float32 or float64 as the intermediate which will cause the the issue of loss of precision
  2. overflow/out of the range

Correctness Issues:

NULL cast/ NULL arithmetic/NULL comparator

String cast:

AGG decimal

User Experience:

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

liukun4515 commented 1 year ago

cc @liukun4515 @kmitchener

kmitchener commented 1 year ago

Thanks @liukun4515. It's all the issues listed in #3480 as well:

Per other discussions, I'd like to help here. I think the general plan should be something like:

I think there will be some churn .. for example, to fix #3521 , we could use the num crate that arrow-rs already uses and fix the implementation of the decimal divide method here. However, that's only a fine short-term solution for stabilization and we'd need to remove the num crate again when we finally move the Decimal kernel implementation over to arrow-rs. Is that acceptable?

Also, how do you deal with the synchronization between the projects? for example, some DF issues are due to the decimal cast code in arrow-rs .. how do we test the fixes? now that we're at the beginning of the dev cycle for 13, can we change DF to pin it to arrow-rs git revs?

cc @alamb

kmitchener commented 1 year ago

Hm, I see the situation in arrow-rs is complicated re decimals and I want to retract all opinions at this time. :)

liukun4515 commented 1 year ago

Also, how do you deal with the synchronization between the projects? for example, some DF issues are due to the decimal cast code in arrow-rs .. how do we test the fixes? now that we're at the beginning of the dev cycle for 13, can we change DF to pin it to arrow-rs git revs?

Hi @kmitchener

I think we can resolve the arithmetic operation for decimal128 data type in this sprint, I will do this and move the operation to the arrow-rs.

At the same time, I will fix the issue about decimal in datafusion. From the bug issue list, I think the most issue about decimal operation in the datafusion are about the divide arithmetic operation.

And other issue is about feature, like casting between string and decimal which can be implement in the arrow-rs.

liukun4515 commented 1 year ago

@kmitchener if the casting feature is important for you, it's great to add it in the arrow-rs kernel.

alamb commented 1 year ago

Also, how do you deal with the synchronization between the projects? for example, some DF issues are due to the decimal cast code in arrow-rs .. how do we test the fixes? now that we're at the beginning of the dev cycle for 13, can we change DF to pin it to arrow-rs git revs?

@kmitchener I have seen two basic patterns for this:

  1. Make a PR that pins to a arrow-rs git revision and complete the changes you want (and this basically leaves the PR as a draft for a while). Example: https://github.com/apache/arrow-datafusion/pull/3386
  2. Make a "wrapper" / "adapter" function for the arrow one, adding whatever new special logic is needed and calling into the arrow kernel if needed. For example https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/expressions/binary/adapter.rs
kmitchener commented 1 year ago

@liukun4515 I'll submit a couple PRs to move the decimal kernels to arrow-rs. I have the code moved locally and it's mostly working -- enough that I'm confident I can complete it.

alamb commented 1 year ago

That would be awesome @kmitchener -- and a long time coming 👏

liukun4515 commented 1 year ago

I'll submit a couple PRs to move the decimal kernels to arrow-rs. I have the code moved locally and it's mostly working -- enough that I'm confident I can complete it.

I think it is not easy work to move the decimal operation to arrow-rs. We can not do the comparation and arithmetic operation based on the I128, we should support the decimal256 also. I try it in my local, it works. But We need do more work to make it grace, there are some pr about decimal in the arrow-rs you can take a look it.

cc @kmitchener

Can you submit your change in your branch? and give me link for your changes I want to take a look your thought.

kmitchener commented 1 year ago

@liukun4515 @alamb Here are the 2 PRs. The one for arrow-rs is ready for review. The one for DF will remain in draft until arrow-rs 24 is released.

https://github.com/apache/arrow-rs/pull/2770 https://github.com/apache/arrow-datafusion/pull/3590

If/once https://github.com/apache/arrow-rs/pull/2770 is merged in, I'll update the DF PR to use git revs.

alamb commented 4 months ago

I wonder if we should claim this ticket is complete? I wonder if there is anything else this ticket is tracking

liukun4515 commented 4 months ago

I wonder if we should claim this ticket is complete? I wonder if there is anything else this ticket is tracking

I will check above unclosed issue in this week.