Open asfimport opened 1 year ago
David Li / @lidavidm:
I might be missing something, but would List<Dictionary<Int32, Utf8>>
work?
Sven Cattell: @lidavidm I'm not sure how to create that with the Rust API. Is it possible to build that nicely in other APIs?
David Li / @lidavidm: I'm not familiar with the Rust APIs, but in Python/C++ it's pretty straightforward:
>>> import pyarrow as pa
>>> ty = pa.list_(pa.dictionary(pa.int16(), pa.string()))
>>> ty
ListType(list<item: dictionary<values=string, indices=int16, ordered=0>>)
>>> pa.array([["tag1", "tag2"], ["tag1", "tag3"]], ty)
<pyarrow.lib.ListArray object at 0x7fc4d89ca940>
[
-- dictionary:
[
"tag1",
"tag2",
"tag3"
]
-- indices:
[
0,
1
],
-- dictionary:
[
"tag1",
"tag2",
"tag3"
]
-- indices:
[
0,
2
]
]
I want to efficiently encode lists of tags for each element in my database. In my case I have 30 tags, and a few are assigned to each of my ~20m records. Here's a simplified example of 5 records:
pe
Right now I have to store these in a List and have huge amounts of duplicate data. The dictionary array looks almost perfect for this task. I just want to allow for a List instead of just T for the allowed primitive index type in a dictionary.
Reporter: Sven Cattell
Note: This issue was originally created as ARROW-18090. Please see the migration documentation for further details.