apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14k stars 3.41k forks source link

Dictionary Style array for Keywords or Tags #33288

Open asfimport opened 1 year ago

asfimport commented 1 year ago

I want to efficiently encode lists of tags for each element in my database. In my case I have 30 tags, and a few are assigned to each of my ~20m records. Here's a simplified example of 5 records:

Reporter: Sven Cattell

Note: This issue was originally created as ARROW-18090. Please see the migration documentation for further details.

asfimport commented 1 year ago

David Li / @lidavidm: I might be missing something, but would List<Dictionary<Int32, Utf8>> work?

asfimport commented 1 year ago

Sven Cattell: @lidavidm  I'm not sure how to create that with the Rust API. Is it possible to build that nicely in other APIs?

asfimport commented 1 year ago

David Li / @lidavidm: I'm not familiar with the Rust APIs, but in Python/C++ it's pretty straightforward:


>>> import pyarrow as pa
>>> ty = pa.list_(pa.dictionary(pa.int16(), pa.string()))
>>> ty
ListType(list<item: dictionary<values=string, indices=int16, ordered=0>>)
>>> pa.array([["tag1", "tag2"], ["tag1", "tag3"]], ty)
<pyarrow.lib.ListArray object at 0x7fc4d89ca940>
[

  -- dictionary:
    [
      "tag1",
      "tag2",
      "tag3"
    ]
  -- indices:
    [
      0,
      1
    ],

  -- dictionary:
    [
      "tag1",
      "tag2",
      "tag3"
    ]
  -- indices:
    [
      0,
      2
    ]
]