apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.64k stars 3.56k forks source link

[Python] Pyarrow DictionaryArray.dictionary_decode mangling strings #34583

Open MurrayData opened 1 year ago

MurrayData commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

I have an Numpy array of strings containing UK postcodes (2,603,678 entries) and another Numpy array of indices (1,785,499,246 entries) to them.

In the output file I want to replace the indices with the strings, so I created a DictionaryArray from them as follows:

postcode_dict = pa.DictionaryArray.from_arrays(pcds_id, pcds)

Where pcds_id contains the indices and pcds contains the postcode strings.

A UK postcode format is A[A]N[N|A] NAA so varies between 6 and 8 characters in length.

A = alpha, N = numeric, | = or, [] = optional.

The dictionary creation works file:

<pyarrow.lib.DictionaryArray object at 0x7fc6b758b8b0>

-- dictionary:
  [
    "AB1 0AA",
    "AB1 0AB",
    "AB1 0AD",
    "AB1 0AE",
    "AB1 0AF",
    "AB1 0AG",
    "AB1 0AJ",
    "AB1 0AL",
    "AB1 0AN",
    "AB1 0AP",
    ...
    "ZE3 9JP",
    "ZE3 9JR",
    "ZE3 9JS",
    "ZE3 9JT",
    "ZE3 9JU",
    "ZE3 9JW",
    "ZE3 9JX",
    "ZE3 9JY",
    "ZE3 9JZ",
    "ZE3 9XP"
  ]
-- indices:
  [
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    ...
    2603676,
    2603676,
    2603677,
    2603677,
    2603677,
    2603677,
    2603677,
    2603677,
    2603677,
    2603677
  ]

However applying dictionary_decode to this array results in the strings becoming mangled:

<pyarrow.lib.StringArray object at 0x7ed1b2e2f760>
[
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  ...
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1",
  " 1UGAL1"
]

However, if I convert the array to pandas it formats correctly:

postcode_dict.to_pandas()

0             AB1 0AA
1             AB1 0AA
2             AB1 0AA
3             AB1 0AA
4             AB1 0AA
               ...   
1785499241    ZE3 9XP
1785499242    ZE3 9XP
1785499243    ZE3 9XP
1785499244    ZE3 9XP
1785499245    ZE3 9XP
Length: 1785499246, dtype: category
Categories (2603678, object): ['AB1 0AA', 'AB1 0AB', 'AB1 0AD', 'AB1 0AE', ..., 'ZE3 9JX', 'ZE3 9JY', 'ZE3 9JZ', 'ZE3 9XP']

Source arrays in NPZ format inside a zip file:

postcode_dict.zip

DictionaryArray in parquet inside a zip file on Google Drive (too large for Github):

postcode_dict_parquet.zip

I'm using a RAPIDS Docker file from NVIDIA NGC (as there are GPU dependencies, cuGraph, in the workflow) and the Pyarrow version is '10.0.1'

Component(s)

Python

Data Source Attribution & License

Data is a spatial weights matrix (crosstab of graph distances) built from UK Office for National Statistics ONS Postcode Directory February 2023

jorisvandenbossche commented 1 year ago

Thanks for the reproducer! I can't directly test now (too big for my laptop's memory), but my first guess about what is going on is that the DictionaryArray itself has a dictionary of string type, while the decoded array should use a large_string type, as the full decoded data is too big to fit into a normal string array (which is using offsets of int32). And from the output you show, it seems to be using a normal StringArray.

If that guess is correct, a possible workaround for now could be to first chunk the dictionary array postcode_dict before decoding it.

postcode_dict_chunked = pa.chunked_array([postcode_dict.slice(i, 100_000) for i in range(0, len(postcode_dict), 100_000)])

Now, we don't have a "decode_dictionary" method on a ChunkedArray, so doing this manually with a take:

postcode_indices_chunked = pa.chunked_array([chunk.indices for chunk in postcode_dict_chunked.chunks])
postcode_decoded = postcode_dict.dictionary.take(postcode_indices_chunked)
MurrayData commented 1 year ago

Thanks for the reproducer! I can't directly test now (too big for my laptop's memory), but my first guess about what is going on is that the DictionaryArray itself has a dictionary of string type, while the decoded array should use a large_string type, as the full decoded data is too big to fit into a normal string array (which is using offsets of int32). And from the output you show, it seems to be using a normal StringArray.....

Thank you, I'll give that a try. Initially I used a DuckDB join as a workaround, which worked fine, but now I've found it decodes correctly with to_pandas I'll use that as a convenient workaround. The reason it's a Numpy array is that the prior step is a spatial join, so therefore I used GeoPandas to do this part of the processing.

jorisvandenbossche commented 1 year ago

So since the decode_dictionary method is actually just implemented as a "take" with the indices (what I showed above), it's this operation that is doing the wrong thing. We have various other issues related to cases where we should probably automatically upcast to large string (eg when concatenating arrays, https://github.com/apache/arrow/issues/33049, https://github.com/apache/arrow/issues/23539, https://github.com/apache/arrow/issues/25822 and others)

The easier workaround is probably to cast to large_string first (then you don't need the manual chunking):

postcode_dict_large = postcode_dict.cast(pa.dictionary(pa.int32(), pa.large_string()))
postcode_dict_large.dictionary_decode()
MurrayData commented 1 year ago

The easier workaround is probably to cast to large_string first (then you don't need the manual chunking):

postcode_dict_large = postcode_dict.cast(pa.dictionary(pa.int32(), pa.large_string()))
postcode_dict_large.dictionary_decode()

Thank you @jorisvandenbossche. Simply specifying type=pa.large_string(), on the DictionaryArray constructor, solved the problem. Noted for future applications. pcds is already np.int32.

postcode_dict = pa.DictionaryArray.from_arrays(pa.array(pcds_id), pa.array(pcds, type=pa.large_string()))

worked fine

postcode_dict.dictionary_decode()

<pyarrow.lib.LargeStringArray object at 0x7f3ca9aea0e0>
[
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  "AB1 0AA",
  ...
  "ZE3 9JZ",
  "ZE3 9JZ",
  "ZE3 9XP",
  "ZE3 9XP",
  "ZE3 9XP",
  "ZE3 9XP",
  "ZE3 9XP",
  "ZE3 9XP",
  "ZE3 9XP",
  "ZE3 9XP"
]