apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.5k stars 745 forks source link

Relax `dict_id` equality in field merging #6356

Open brancz opened 3 weeks ago

brancz commented 3 weeks ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Trying to merge two schemas that are compatible, other than having fields whose dict_ids are different.

Describe the solution you'd like

Relax dict_id handling during merging of fields, such that if they are different the dict_id is unset, and only if they are equal are they kept.

Describe alternatives you've considered

Before opening this, I tried implementing my own merge function, but would have to change it quite significantly compared to the existing code, since I can't access internal fields. I could still make this work, but wanted to first check whether the project would be open to relaxing this, since this feels more aligned with how merging works otherwise (eg. making a field nullable).

Additional context

All of this is under the assumption that I understand dict_id correctly, which is that it's primarily used for IPC's ability to duplicate dicts.

tustvold commented 3 weeks ago

https://github.com/apache/arrow-rs/issues/5981 is possibly related.

As it stands I am not sure we can ignore the dict ID when merging, as unlike with nulls, two arrays with different dictionary IDs can't be "merged", as there is no covering relationship between two different dict IDs.

brancz commented 3 weeks ago

Isn't that just an optimization though? I understand that if two dict IDs are identical one can just append the index arrays to each other safely, and if not then the dicts needs to be unified and indexes added transposed (using the terminology that the Go arrow package uses not sure it's universal it's just the library I have the most experience with).

tustvold commented 3 weeks ago

Unfortunately that isn't how the IPC implementation treats them, in particular IPC files use dictionary IDs as unique identifiers to immutable dictionaries. It is all a bit confused, and it is peculiar that they're part of the schema definition at all, but that is what #5981 concerns.

brancz commented 3 weeks ago

If I understand this right then, it should be possible to remove dict_id from Field if we solve IPC dictionary ID handling some other way. And consequently dict_id would no longer be part of equality of Field. Do I understand correctly?