apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[Python ] Support array objects (beyond numpy) in python->arrow conversion #38675

Open genesis-jamin opened 11 months ago

genesis-jamin commented 11 months ago

Describe the usage question you have. Please include as many useful details as possible.

Hi, I'm running into an issue while using Ray to write a nested dictionary of tensors to parquet. I'm posting here because from the stacktrace, it seems like the error is coming from pyarrow.

Minimal reproducible example:

import ray
import torch
ds_test = ray.data.from_items([{"nested": {"tensor_a": torch.zeros(5), "tensor_b": torch.zeros(5)}}])
ds_test.write_parquet("local://test_parquet")

The error:

pyarrow.lib.ArrowInvalid: ('Could not convert tensor([0., 0., 0., 0., 0.]) with type Tensor: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column nested with type object')

I'm not familiar with parquet so this might just be a usage error. Is this type of nested datastructure supported? Currently I get around it by flattening and unflattening my dictionary, but I'm wondering if there's a better way to go about it.

Component(s)

Parquet

AlenkaF commented 11 months ago

Hi @genesis-jamin , thank you for opening up an issue!

Looking at the Ray docs I see a note on https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html saying

If pyarrow can’t represent your data, this method errors.

So this is happening in your case as there is no option currently to consume PyTorch tensors in PyArrow. If I understand correctly, Ray is going through pandas (and pandas is using pyarrow) to write the dataset to parquet file and because torch tensors are not recognised by pyarrow you get an error.

One option would be to use torch tensor as a NumPy ndarray (only a view, if I understand correctly https://pytorch.org/docs/stable/generated/torch.Tensor.numpy.html)

In [1]: import ray
   ...: import torch
   ...: ds_test = ray.data.from_items([{"nested": {"tensor_a": torch.zeros(5).numpy(), "tensor_b": torch.zeros(5).numpy()}}])
2023-11-13 12:48:11,823 INFO worker.py:1673 -- Started a local Ray instance.

In [2]: ds_test.write_parquet("test_parquet")
2023-11-13 12:48:15,472 INFO streaming_executor.py:104 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Write]
...

In [3]: import pyarrow as pa

In [4]: pa.parquet.read_table("test_parquet").to_pandas()
Out[4]: 
                                              nested
0  {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso...
1  {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso...
2  {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso...

It would be nice to be able to convert these kind of data to numpy arrays (or even fixed shape tensor in the nested case) on our side, what do you think @jorisvandenbossche ?

jorisvandenbossche commented 11 months ago

I think it would probably make sense to support any "array-like" (defined by some protocol, eg __array_interface__ or __dlpack__) instead of only numpy arrays.

At the moment this fails in the TypeInferrer for python objects. We have the following for sequence like objects:

https://github.com/apache/arrow/blob/160d45c251b5041b9688ff68c3a7c5e091a50989/python/pyarrow/src/arrow/python/inference.cc#L400-L409

So we currently support sets, lists, tuples and ndarrays (PyArray_Check is from the numpy C API to check for numpy ndarrays).

jorisvandenbossche commented 11 months ago

And note that this has not directly to do with being nested, also for a flat list of pytorch tensors, you will get the same issue:

In [3]: pa.array([torch.zeros(5), torch.zeros(5)])
...
ArrowInvalid: Could not convert tensor([0., 0., 0., 0., 0.]) with type Tensor: did not recognize Python value type when inferring an Arrow data type

(but nesting in a dictionary makes it certainly harder to work around this, for the non-nested version above you can create a ListArray or FixedShapeTensorArray from a single dim+1 torch tensor)

genesis-jamin commented 11 months ago

Thanks @AlenkaF and @jorisvandenbossche ! Switching to Numpy fixes the issue for me (and I realize that the "nested" aspect was a red herring)