Open jayzhan211 opened 10 months ago
@tustvold Do you think introduce a lossless version of can_cast_types
helpful?
cc @alamb
Decimal128 can cast to Float64 in can_coerce_from, why?
Floating points inherently permit precision loss. It is fairly standard that otherwise checked conversion permit precision loss when converting between floats.
For example - https://docs.rs/num-traits/latest/num_traits/cast/trait.FromPrimitive.html
A value can be represented by the target type when it lies within the range of scalars supported by the target type. For example, a negative integer cannot be represented by an unsigned integer type, and an i64 with a very high magnitude might not be convertible to an i32. On the other hand, conversions with possible precision loss or truncation are admitted, like an f32 with a decimal part to an integer type, or even a large f64 saturating to f32 infinity.
No list coercion for can_coerce_from
I think this would make sense as an addition, the current support for nested types in arrow-cast is very limited.
If lossless conversion is useful for arrow-rs
In general I think it would be very confusing if coercion can result in silent data loss, perhaps you might articulate why this would be useful? My current thinking is casting should never silently lose data.
My current thinking is casting should never silently lose data.
Comment in can_cast_types
mention that cast is lossy.
Casting Int64 to Int32 return true. I'm not sure if the actual
casting block this lossy conversion, but if not, could it silently lose data?
That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data
That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data
Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing.
However, I think we can still cleanup comparison_coercion
and can_coerce_from
and forget about lossy conversion.
That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data
Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing.
Maybe Int64(1) cast to Int32(1) is one that is able to safely cast without data loss.
Maybe we need another verion of can_cast_types
that no possible fallible hint. For i64, only u64 or larger type, no i32 or i16 since we don't know the actual data.
Is that not what can_coerce_from is?
can_coerce_from
Yes! Should we have it aside can_cast_types
in arrow-rs? Or just let it stay in datafusion for now.
Given the type coercion logic itself lives in DataFusion I personally think it makes sense for it to remain alongside it
Short summary for the next step of this issue
can_coerce_from
with component in comparison_coercion
, so we dont need to think about which one to use. Reuse the code if possible, call can_cast_types
if possible.comparison_coercion
, so we can differentiate unimplemented casting
or not possible casting
~ Return Option<T>
is probably enough.comparison_coercion
public, like https://github.com/apache/arrow-datafusion/pull/8308 can have numerice_coercion instead of comparison_coercion
~. No use case.comparison_coercion
, while Decimal is coerced to Float in can_coerced_from
. Confused π π π can_coerce_from
/ comparison_coercion
unlike arrow-cast
should only check the casting to the larger type since the actual data is unknown.
In comparison_binary_numeric_coercion
, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?
for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too
In
comparison_binary_numeric_coercion
, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too
The comment in the code you linked seems to say "some information loss is inevitable":
// The following match arms encode the following logic: Given the two
// integral types, we choose the narrowest possible integral type that
// accommodates all values of both types. Note that some information
// loss is inevitable when we have a signed type and a `UInt64`, in
// which case we use `Int64`;i.e. the widest signed integral type.
It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast π€·
In
comparison_binary_numeric_coercion
, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are? for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this tooThe comment in the code you linked seems to say "some information loss is inevitable":
// The following match arms encode the following logic: Given the two // integral types, we choose the narrowest possible integral type that // accommodates all values of both types. Note that some information // loss is inevitable when we have a signed type and a `UInt64`, in // which case we use `Int64`;i.e. the widest signed integral type.
It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast π€·
I don't think that usability is the good reason :( It might cause data loss.
@alamb I would like to change them to the straightforward version, choose the larger type between two. I think it can also be used all the place in datafusion easily (one place is can_coerce_from
). Is there any concern about this change?
If such implicit casting is found helpful in the end, we can have two version, one that lossy is acceptable, one is not.
. Is there any concern about this change?
I would recommend you write some tests / try out the change before making a full featured PR. I am not sure what the implications are with the change you propose, but I suspect it could be non trivial and subtle
// The following match arms encode the following logic: Given the two // integral types, we choose the narrowest possible integral type that // accommodates all values of both types. Note that some information // loss is inevitable when we have a signed type and a
UInt64
, in // which case we useInt64
;i.e. the widest signed integral type.
I rethink about this. This is the best we can do, like the comment mentioned, it is inevitable :(
After #8385, I found that it is not ideal to have one coercion for all the place (compare op, math op, signature coercion, etc... ).
Is your feature request related to a problem or challenge?
I think there is room for improvement in
type coerceion
orcasting
.Background
comparison_coercion
is widely used in datafusion, a lossless conversion https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/binary.rscan_coerce_from
is used mainly for signature, a lossless conversion https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/functions.rscan_cast_types
is fromarrow-cast
, which is a lossy conversion. It is also used in somecomparison_coercion
building block. https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76-L273Not sure if there is other coercion I missed
Proposal
comparison_coercion
andcan_coerce_from
seem like doing the similar thing, maybe we can just have onelossless
conversion. If lossless conversion is useful for arrow-rs, we can introduce a lossless version ofcan_cast_types
, then rely on it for datafusion.Lossy conversion vs Lossless
I think the definition for lossy is that the value is not recoverable after casting back, otherwise it is lossless.
Lossy
Lossless
Describe the solution you'd like
can_coerce_from
withcomparison_coercion
's building blocknumeric coercion
,list coercion
,string coercion
,null coercion
, etclist_coercion
fromstring_coercion
to make each building block of coercion clear on the task it focus on. list_coercion dolist
/fixed size list
/large list
coercion,string_coercion
doutf
/large utf
coercion.Known issue or question I have
list_coercion
that currently exist instring_concat_coercion
can_coerce_from
can_coerce_from
, why?Describe alternatives you've considered
If there are many customize conversion need, then this change might not be helpful at all. We need other approach to let type casting / coercion easy to use.
No response
Additional context
No response