apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.62k stars 802 forks source link

Try casting structs by name before by position #6726

Open aykut-bozkurt opened 1 week ago

aykut-bozkurt commented 1 week ago

Which issue does this PR close?

Part of #4908.

Rationale for this change

Supports more flexible struct to struct conversions.

What changes are included in this PR?

Struct to struct cast already works by casting each field by position (to_fields.zip(from_fields)). #5221

This PR adds cast by name method and if cast by name is not possible, it fallbacks to cast by position (current method).

We can first try casting by matching field names if both structs have the same field names and the corresponding fields can be cast. If casting by name is not possible, it can fall back to casting by position.

Are there any user-facing changes?

No breaking api change. But user behavior may change. The struct, that is previously casted via position, could cast by name now.

aykut-bozkurt commented 1 week ago

Alternative is to let the user decide on which approach to use. But, we need to pass cast_struct_by_name as an option to can_cast_type function, which is api change.

tustvold commented 1 week ago

I'm not sure about this, casting by name comes with its own sets of issues, from performance to potential ambiguity.

I wonder if this is really the right level to be implementing this logic, or whether this really belongs in some sort of higher-level schema adapter.

Tagging @alamb as there is a fair amount of such schema munging logic in DF.

Regardless this is a breaking API change as written

alamb commented 1 week ago

I'm not sure about this, casting by name comes with its own sets of issues, from performance to potential ambiguity.

I agree it is non obvious and this the user should have to explicitly request this behavior

I wonder if this is really the right level to be implementing this logic, or whether this really belongs in some sort of higher-level schema adapter.

I personally think it makes sense in the cast kernel as it is a common operation (e.g. to convert structs with the same fields but in different order)

However that being said, a Struct is basically like a Schema (named fields with types) so I can see the argument that by supporting this type of cast in arrow-rs it will evolve over time into a schema adapter / matching code (e.g. filling out missing columns with NULLs, reordering columns, etc)

Tagging @alamb as there is a fair amount of such schema munging logic in DF.

Regardless this is a breaking API change as written

I agree. I also think it would be nice to avoid a breaking change if not necessary and I don't think it is necessary in this case.

Tto make it not a breaking change I think we could introduce can_cast_types_with_options to mirror cast_with_options and then control the field level rearranging based on the fields of CastOptions

tustvold commented 1 week ago

so I can see the argument that by supporting this type of cast in arrow-rs it will evolve over time into a schema adapter / matching code

Right, and then you're doing potentially non-trivial work per batch, matching columns, etc... as opposed to per-schema. For example DF's SchemaAdapter computes the mapping once and can then apply that to multiple batches.

I guess what I am suggesting is that rather than baking this into the cast kernel, if we add a first-party schema adapter into arrow-rs. Potentially upstreaming the one already in DF with some modifications?

alamb commented 1 week ago

I guess what I am suggesting is that rather than baking this into the cast kernel, if we add a first-party schema adapter into arrow-rs. Potentially upstreaming the one already in DF with some modifications

For anyone interested, here is the API that is in DataFusion (it now even has ASCII art and Examples, thanks to @itsjunetime and myself):

We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs

tustvold commented 1 week ago

We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs

100%, and I think the level of complexity that component has grown highlights my concern of trying to shoehorn all of this logic into the cast kernel directly.

alamb commented 6 days ago

We can/should probably change the names and reduce the levels of indirection of we upstreamed this into arrow-rs

100%, and I think the level of complexity that component has grown highlights my concern of trying to shoehorn all of this logic into the cast kernel directly.

I filed https://github.com/apache/arrow-rs/issues/6735 to track this idea

alamb commented 6 days ago

Sorry for all this back and forth @aykut-bozkurt -- what do you think about the discussion so far?

aykut-bozkurt commented 6 days ago

Sorry for all this back and forth @aykut-bozkurt -- what do you think about the discussion so far?

That makes sense to me. The proposal lets us do even more useful things (e.g. handle missing columns) with hopefully better performance.