apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.02k stars 1.14k forks source link

Improve type coercion and casting #8302

Open jayzhan211 opened 10 months ago

jayzhan211 commented 10 months ago

Is your feature request related to a problem or challenge?

I think there is room for improvement in type coerceion or casting.

Background

comparison_coercion is widely used in datafusion, a lossless conversion https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/binary.rs

can_coerce_from is used mainly for signature, a lossless conversion https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/functions.rs

can_cast_types is from arrow-cast, which is a lossy conversion. It is also used in some comparison_coercion building block. https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76-L273

Not sure if there is other coercion I missed

Proposal

comparison_coercion and can_coerce_from seem like doing the similar thing, maybe we can just have one lossless conversion. If lossless conversion is useful for arrow-rs, we can introduce a lossless version of can_cast_types, then rely on it for datafusion.

Lossy conversion vs Lossless

I think the definition for lossy is that the value is not recoverable after casting back, otherwise it is lossless.

Lossy

Describe the solution you'd like

  1. Replace can_coerce_from with comparison_coercion's building block numeric coercion, list coercion, string coercion, null coercion, etc
  2. Split list_coercion from string_coercion to make each building block of coercion clear on the task it focus on. list_coercion do list/fixed size list/large list coercion, string_coercion do utf/large utf coercion.
  3. Introduce these lossless coercion to arrow-rs?

Known issue or question I have

Describe alternatives you've considered

If there are many customize conversion need, then this change might not be helpful at all. We need other approach to let type casting / coercion easy to use.

No response

Additional context

No response

jayzhan211 commented 10 months ago

@tustvold Do you think introduce a lossless version of can_cast_types helpful? cc @alamb

tustvold commented 10 months ago

Decimal128 can cast to Float64 in can_coerce_from, why?

Floating points inherently permit precision loss. It is fairly standard that otherwise checked conversion permit precision loss when converting between floats.

For example - https://docs.rs/num-traits/latest/num_traits/cast/trait.FromPrimitive.html

A value can be represented by the target type when it lies within the range of scalars supported by the target type. For example, a negative integer cannot be represented by an unsigned integer type, and an i64 with a very high magnitude might not be convertible to an i32. On the other hand, conversions with possible precision loss or truncation are admitted, like an f32 with a decimal part to an integer type, or even a large f64 saturating to f32 infinity.

No list coercion for can_coerce_from

I think this would make sense as an addition, the current support for nested types in arrow-cast is very limited.

If lossless conversion is useful for arrow-rs

In general I think it would be very confusing if coercion can result in silent data loss, perhaps you might articulate why this would be useful? My current thinking is casting should never silently lose data.

jayzhan211 commented 10 months ago

My current thinking is casting should never silently lose data.

https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76C2-L80

Comment in can_cast_types mention that cast is lossy.

https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L212-L216

Casting Int64 to Int32 return true. I'm not sure if the actual casting block this lossy conversion, but if not, could it silently lose data?

tustvold commented 10 months ago

That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data

jayzhan211 commented 10 months ago

That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data

Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing.

jayzhan211 commented 10 months ago

However, I think we can still cleanup comparison_coercion and can_coerce_from and forget about lossy conversion.

jayzhan211 commented 10 months ago

That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data

Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing.

Maybe Int64(1) cast to Int32(1) is one that is able to safely cast without data loss.

jayzhan211 commented 10 months ago

Maybe we need another verion of can_cast_types that no possible fallible hint. For i64, only u64 or larger type, no i32 or i16 since we don't know the actual data.

tustvold commented 10 months ago

Is that not what can_coerce_from is?

jayzhan211 commented 10 months ago

can_coerce_from

Yes! Should we have it aside can_cast_types in arrow-rs? Or just let it stay in datafusion for now.

tustvold commented 10 months ago

Given the type coercion logic itself lives in DataFusion I personally think it makes sense for it to remain alongside it

jayzhan211 commented 10 months ago

Short summary for the next step of this issue

  1. Replace can_coerce_from with component in comparison_coercion, so we dont need to think about which one to use. Reuse the code if possible, call can_cast_types if possible.
  2. Add nested support. Start from List

TODO

Issue

Note

can_coerce_from / comparison_coercion unlike arrow-cast should only check the casting to the larger type since the actual data is unknown.

jayzhan211 commented 10 months ago

https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406

In comparison_binary_numeric_coercion, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?

for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too

alamb commented 10 months ago

https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406

In comparison_binary_numeric_coercion, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?

for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too

The comment in the code you linked seems to say "some information loss is inevitable":

        // The following match arms encode the following logic: Given the two
        // integral types, we choose the narrowest possible integral type that
        // accommodates all values of both types. Note that some information
        // loss is inevitable when we have a signed type and a `UInt64`, in
        // which case we use `Int64`;i.e. the widest signed integral type.

It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast 🀷

jayzhan211 commented 10 months ago

https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406

In comparison_binary_numeric_coercion, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are? for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too

The comment in the code you linked seems to say "some information loss is inevitable":

        // The following match arms encode the following logic: Given the two
        // integral types, we choose the narrowest possible integral type that
        // accommodates all values of both types. Note that some information
        // loss is inevitable when we have a signed type and a `UInt64`, in
        // which case we use `Int64`;i.e. the widest signed integral type.

It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast 🀷

I don't think that usability is the good reason :( It might cause data loss.

@alamb I would like to change them to the straightforward version, choose the larger type between two. I think it can also be used all the place in datafusion easily (one place is can_coerce_from). Is there any concern about this change?

If such implicit casting is found helpful in the end, we can have two version, one that lossy is acceptable, one is not.

alamb commented 10 months ago

. Is there any concern about this change?

I would recommend you write some tests / try out the change before making a full featured PR. I am not sure what the implications are with the change you propose, but I suspect it could be non trivial and subtle

jayzhan211 commented 10 months ago

// The following match arms encode the following logic: Given the two // integral types, we choose the narrowest possible integral type that // accommodates all values of both types. Note that some information // loss is inevitable when we have a signed type and a UInt64, in // which case we use Int64;i.e. the widest signed integral type.

I rethink about this. This is the best we can do, like the comment mentioned, it is inevitable :(

jayzhan211 commented 10 months ago

After #8385, I found that it is not ideal to have one coercion for all the place (compare op, math op, signature coercion, etc... ).