Closed brancz closed 1 month ago
Tagging @thinkharderdev who implemented this functionality in https://github.com/apache/arrow-rs/pull/5971
This confused me at first as well, but #5971 actually only solved it for IPC via flight, because it pre-processes the schema. This is about IPC without flight.
I opened #6444 which intentionally doesn't change the behavior introduced in #5971, but I think long term it would be better to consolidate at the lower level that I implemented in #6444, because that will mean that we'll actually be able to remove the dict_id
field from the Schema
, as with the approach in #6444 the assigning the dict ID is done entirely separately from the Schema
, and only through the dictionary tracker.
label_issue.py
automatically added labels {'parquet'} from #6444
label_issue.py
automatically added labels {'arrow'} from #6444
label_issue.py
automatically added labels {'arrow-flight'} from #6444
label_issue.py
automatically added labels {'next-major-release'} from #6444
Describe the bug
When setting
with_preserve_dict_id(false)
onIpcWriteOptions
of aStreamWriter
, and then write a record with multiple dicts whoseField
s in theSchema
havedict_id: 0
, then the last dict's dictionary is actually used for all occurrences.Best case this causes data to be incorrect, worst case, it causes a panic (which is what led me down this path because my first dictionary had more entries than the second and it caused an out of bounds panic).
To Reproduce
https://gist.github.com/brancz/067bfe6c9f9dfa7a7db82da1757e0edc results in
Expected behavior
Dicts are assigned correctly when the
Schema
'sField
'sdict_id
is requested to not be preserved.@alamb @tustvold