Open MurrayData opened 1 year ago
Thanks for the reproducer! I can't directly test now (too big for my laptop's memory), but my first guess about what is going on is that the DictionaryArray itself has a dictionary of string type, while the decoded array should use a large_string type, as the full decoded data is too big to fit into a normal string array (which is using offsets of int32). And from the output you show, it seems to be using a normal StringArray.
If that guess is correct, a possible workaround for now could be to first chunk the dictionary array postcode_dict
before decoding it.
postcode_dict_chunked = pa.chunked_array([postcode_dict.slice(i, 100_000) for i in range(0, len(postcode_dict), 100_000)])
Now, we don't have a "decode_dictionary" method on a ChunkedArray, so doing this manually with a take:
postcode_indices_chunked = pa.chunked_array([chunk.indices for chunk in postcode_dict_chunked.chunks])
postcode_decoded = postcode_dict.dictionary.take(postcode_indices_chunked)
Thanks for the reproducer! I can't directly test now (too big for my laptop's memory), but my first guess about what is going on is that the DictionaryArray itself has a dictionary of string type, while the decoded array should use a large_string type, as the full decoded data is too big to fit into a normal string array (which is using offsets of int32). And from the output you show, it seems to be using a normal StringArray.....
Thank you, I'll give that a try. Initially I used a DuckDB join as a workaround, which worked fine, but now I've found it decodes correctly with to_pandas I'll use that as a convenient workaround. The reason it's a Numpy array is that the prior step is a spatial join, so therefore I used GeoPandas to do this part of the processing.
So since the decode_dictionary
method is actually just implemented as a "take" with the indices (what I showed above), it's this operation that is doing the wrong thing. We have various other issues related to cases where we should probably automatically upcast to large string (eg when concatenating arrays, https://github.com/apache/arrow/issues/33049, https://github.com/apache/arrow/issues/23539, https://github.com/apache/arrow/issues/25822 and others)
The easier workaround is probably to cast to large_string first (then you don't need the manual chunking):
postcode_dict_large = postcode_dict.cast(pa.dictionary(pa.int32(), pa.large_string()))
postcode_dict_large.dictionary_decode()
The easier workaround is probably to cast to large_string first (then you don't need the manual chunking):
postcode_dict_large = postcode_dict.cast(pa.dictionary(pa.int32(), pa.large_string())) postcode_dict_large.dictionary_decode()
Thank you @jorisvandenbossche. Simply specifying type=pa.large_string()
, on the DictionaryArray constructor, solved the problem. Noted for future applications. pcds is already np.int32.
postcode_dict = pa.DictionaryArray.from_arrays(pa.array(pcds_id), pa.array(pcds, type=pa.large_string()))
worked fine
postcode_dict.dictionary_decode()
<pyarrow.lib.LargeStringArray object at 0x7f3ca9aea0e0>
[
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
"AB1 0AA",
...
"ZE3 9JZ",
"ZE3 9JZ",
"ZE3 9XP",
"ZE3 9XP",
"ZE3 9XP",
"ZE3 9XP",
"ZE3 9XP",
"ZE3 9XP",
"ZE3 9XP",
"ZE3 9XP"
]
Describe the bug, including details regarding any error messages, version, and platform.
I have an Numpy array of strings containing UK postcodes (2,603,678 entries) and another Numpy array of indices (1,785,499,246 entries) to them.
In the output file I want to replace the indices with the strings, so I created a DictionaryArray from them as follows:
Where pcds_id contains the indices and pcds contains the postcode strings.
A UK postcode format is A[A]N[N|A] NAA so varies between 6 and 8 characters in length.
A = alpha, N = numeric, | = or, [] = optional.
The dictionary creation works file:
However applying dictionary_decode to this array results in the strings becoming mangled:
However, if I convert the array to pandas it formats correctly:
Source arrays in NPZ format inside a zip file:
postcode_dict.zip
DictionaryArray in parquet inside a zip file on Google Drive (too large for Github):
postcode_dict_parquet.zip
I'm using a RAPIDS Docker file from NVIDIA NGC (as there are GPU dependencies, cuGraph, in the workflow) and the Pyarrow version is '10.0.1'
Component(s)
Python
Data Source Attribution & License
Data is a spatial weights matrix (crosstab of graph distances) built from UK Office for National Statistics ONS Postcode Directory February 2023