Open aykut-bozkurt opened 1 week ago
Alternative is to let the user decide on which approach to use. But, we need to pass cast_struct_by_name
as an option to can_cast_type
function, which is api change.
I'm not sure about this, casting by name comes with its own sets of issues, from performance to potential ambiguity.
I wonder if this is really the right level to be implementing this logic, or whether this really belongs in some sort of higher-level schema adapter.
Tagging @alamb as there is a fair amount of such schema munging logic in DF.
Regardless this is a breaking API change as written
I'm not sure about this, casting by name comes with its own sets of issues, from performance to potential ambiguity.
I agree it is non obvious and this the user should have to explicitly request this behavior
I wonder if this is really the right level to be implementing this logic, or whether this really belongs in some sort of higher-level schema adapter.
I personally think it makes sense in the cast kernel as it is a common operation (e.g. to convert structs with the same fields but in different order)
However that being said, a Struct
is basically like a Schema
(named fields with types) so I can see the argument that by supporting this type of cast in arrow-rs it will evolve over time into a schema adapter / matching code (e.g. filling out missing columns with NULLs, reordering columns, etc)
Tagging @alamb as there is a fair amount of such schema munging logic in DF.
Regardless this is a breaking API change as written
I agree. I also think it would be nice to avoid a breaking change if not necessary and I don't think it is necessary in this case.
Tto make it not a breaking change I think we could introduce can_cast_types_with_options
to mirror cast_with_options
and then control the field level rearranging based on the fields of CastOptions
so I can see the argument that by supporting this type of cast in arrow-rs it will evolve over time into a schema adapter / matching code
Right, and then you're doing potentially non-trivial work per batch, matching columns, etc... as opposed to per-schema. For example DF's SchemaAdapter computes the mapping once and can then apply that to multiple batches.
I guess what I am suggesting is that rather than baking this into the cast kernel, if we add a first-party schema adapter into arrow-rs. Potentially upstreaming the one already in DF with some modifications?
I guess what I am suggesting is that rather than baking this into the cast kernel, if we add a first-party schema adapter into arrow-rs. Potentially upstreaming the one already in DF with some modifications
For anyone interested, here is the API that is in DataFusion (it now even has ASCII art and Examples, thanks to @itsjunetime and myself):
We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs
We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs
100%, and I think the level of complexity that component has grown highlights my concern of trying to shoehorn all of this logic into the cast kernel directly.
We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs
100%, and I think the level of complexity that component has grown highlights my concern of trying to shoehorn all of this logic into the cast kernel directly.
I filed https://github.com/apache/arrow-rs/issues/6735 to track this idea
Sorry for all this back and forth @aykut-bozkurt -- what do you think about the discussion so far?
Sorry for all this back and forth @aykut-bozkurt -- what do you think about the discussion so far?
That makes sense to me. The proposal lets us do even more useful things (e.g. handle missing columns) with hopefully better performance.
Which issue does this PR close?
Part of #4908.
Rationale for this change
Supports more flexible struct to struct conversions.
What changes are included in this PR?
Struct to struct cast already works by casting each field by position (
to_fields.zip(from_fields)
). #5221This PR adds
cast by name
method and if cast by name is not possible, it fallbacks tocast by position
(current method).We can first try casting by matching field names if both structs have the same field names and the corresponding fields can be cast. If casting by name is not possible, it can fall back to casting by position.
Are there any user-facing changes?
No breaking api change. But user behavior may change. The struct, that is previously casted via position, could cast by name now.