apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.59k stars 791 forks source link

IPC not respecting not preserving dict ID #6443

Closed brancz closed 1 month ago

brancz commented 1 month ago

Describe the bug

When setting with_preserve_dict_id(false) on IpcWriteOptions of a StreamWriter, and then write a record with multiple dicts whose Fields in the Schema have dict_id: 0, then the last dict's dictionary is actually used for all occurrences.

Best case this causes data to be incorrect, worst case, it causes a panic (which is what led me down this path because my first dictionary had more entries than the second and it caused an out of bounds panic).

To Reproduce

https://gist.github.com/brancz/067bfe6c9f9dfa7a7db82da1757e0edc results in

assertion `left == right` failed
  left: RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "b"
, data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
] values: StringArray
[
  "e",
  "f",
]}
, DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
] values: StringArray
[
  "e",
  "f",
]}
], row_count: 2 }
 right: RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "b"
, data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, columns: [DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
] values: StringArray
[
  "c",
  "d",
]}
, DictionaryArray {keys: PrimitiveArray<Int32>
[
  0,
  1,
] values: StringArray
[
  "e",
  "f",
]}
], row_count: 2 }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior

Dicts are assigned correctly when the Schema's Field's dict_id is requested to not be preserved.

@alamb @tustvold

tustvold commented 1 month ago

Tagging @thinkharderdev who implemented this functionality in https://github.com/apache/arrow-rs/pull/5971

brancz commented 1 month ago

This confused me at first as well, but #5971 actually only solved it for IPC via flight, because it pre-processes the schema. This is about IPC without flight.

I opened #6444 which intentionally doesn't change the behavior introduced in #5971, but I think long term it would be better to consolidate at the lower level that I implemented in #6444, because that will mean that we'll actually be able to remove the dict_id field from the Schema, as with the approach in #6444 the assigning the dict ID is done entirely separately from the Schema, and only through the dictionary tracker.

alamb commented 1 month ago

label_issue.py automatically added labels {'parquet'} from #6444

alamb commented 1 month ago

label_issue.py automatically added labels {'arrow'} from #6444

alamb commented 1 month ago

label_issue.py automatically added labels {'arrow-flight'} from #6444

alamb commented 1 month ago

label_issue.py automatically added labels {'next-major-release'} from #6444