Open tustvold opened 2 years ago
Another alternative is to remove dict_id
entirely from Field
and only use it while doing IPC serialization
This appears to be what @jorgecarleitao has done in arrow2
https://github.com/jorgecarleitao/arrow2/search?q=dict_id https://github.com/jorgecarleitao/arrow2/blob/main/src/datatypes/field.rs#L12-L21
It seems like a serde concern to me, and I don't see any value of having it in the schema. I'd be in favor of removing it from Field
entirely and assigning it only during serde.
Another alternative is to remove
dict_id
entirely fromField
and only use it while doing IPC serializationThis appears to be what @jorgecarleitao has done in arrow2
https://github.com/jorgecarleitao/arrow2/search?q=dict_id https://github.com/jorgecarleitao/arrow2/blob/main/src/datatypes/field.rs#L12-L21
We're currently having a lot of issues with dict_id
in arrow flight. There are a couple of different issues which I think @alamb suggestion above would alleviate:
dict_id
all the time ends up creating a lot of boilerplate since you need some way to track the current dict_id
s and ensure that you always preserve the dict_id
once it is assigned.FlightDataEncoder
is inconsistent in where it gets dictionary ids from. When it sends the schema
it will of course send the dict_id
defined in the schema field. But when serializing the array it will send the dict_id
from the array's datatype. Theoretically these should always be the same but since PartialEq
on Field
ignores the dict_id
this is hard to enforce in practice. dict_id
when building an array with nested dictionary fields. Until very recently, using GenericListBuilder
would just default the dict_id
on it's item
field. That is all to say I'd like to work on making dict_id
an internal implementation detail of the serialization and completely ignore the field's dict_id
for that purpose. Then we can deprecate Field::dict_id
and then eventually remove the dict_id
field from Field
entirely.
+1 it has been a major foot gun for us.
That is all to say I'd like to work on making
dict_id
an internal implementation detail of the serialization and completely ignore the field'sdict_id
for that purpose. Then we can deprecateField::dict_id
and then eventually remove thedict_id
field fromField
entirely.
That seems reasonable to me from what I can tell.
cc @tustvold
Also I wonder if @jhorstmann has any thoughts in this matter. I believe he was / has used dict_id
in the past so maybe has some more perspective to share
Dict id being present as a first-party field inside the schema has always felt a bit odd to me, I fully support making it a serde detail.
If users want a potentially broader notion of dictionary IDs at the schema level, there is nothing to prevent them using field metadata to do this
AFAIK we are not using the dict_id
property. We make heavy use of dictionary arrays, and one of the invariants in our system is that columns use the same dictionary across batches, but this is not reflected in the schema or fields. In fact, when we tried to use the ipc format to persist data we were post-processing the schema and assigning dictionary ids based on the position of the field in the schema. That post-processing would not have been necessary if the dict id was only part of the serialization api.
(A bit later we switched to just hydrating the dictionary arrays, because the dictionary were often larger than the actual filtered or aggregated data.)
Which part is this question about
The
Field
data structure contains adict_id
member, that stores an i64. It appears the intention of this is that different dictionaries will have different IDs, however, this currently appears to only be respected by the IPC format and isn't widely utilised by arrow-rs.Describe your question
Most of arrow-rs is completely agnostic to dict_id, with compute kernels completely ignoring it, even those that recompute dictionaries such as concat.
The only parts of the stack that appear to use the dict_ids are the IPC interfaces, which will error if they encounter the same dict_id multiple times. I think this is inconsistency is a tad confusing, I think we should do one of the following:
Of these the first would definitely be simpler to implement, but I'm not familiar enough with the purpose of dict_id to be certain there isn't some use-case this would preclude?
Additional context
As
Field
is part of theSchema
, RecordBatch with different dict_id will appear to have different schema. This may have downstream implications for things like DataFusion which have strong assumptions on schema consistency within a plan.This cropped up in https://github.com/apache/arrow-datafusion/pull/1596 as it is using the arrow IPC format to spill buffers to disk.