Open genesis-jamin opened 11 months ago
Hi @genesis-jamin , thank you for opening up an issue!
Looking at the Ray docs I see a note on https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html saying
If pyarrow can’t represent your data, this method errors.
So this is happening in your case as there is no option currently to consume PyTorch tensors in PyArrow. If I understand correctly, Ray is going through pandas (and pandas is using pyarrow) to write the dataset to parquet file and because torch tensors are not recognised by pyarrow you get an error.
One option would be to use torch tensor as a NumPy ndarray (only a view, if I understand correctly https://pytorch.org/docs/stable/generated/torch.Tensor.numpy.html)
In [1]: import ray
...: import torch
...: ds_test = ray.data.from_items([{"nested": {"tensor_a": torch.zeros(5).numpy(), "tensor_b": torch.zeros(5).numpy()}}])
2023-11-13 12:48:11,823 INFO worker.py:1673 -- Started a local Ray instance.
In [2]: ds_test.write_parquet("test_parquet")
2023-11-13 12:48:15,472 INFO streaming_executor.py:104 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Write]
...
In [3]: import pyarrow as pa
In [4]: pa.parquet.read_table("test_parquet").to_pandas()
Out[4]:
nested
0 {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso...
1 {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso...
2 {'tensor_a': [0.0, 0.0, 0.0, 0.0, 0.0], 'tenso...
It would be nice to be able to convert these kind of data to numpy arrays (or even fixed shape tensor in the nested case) on our side, what do you think @jorisvandenbossche ?
I think it would probably make sense to support any "array-like" (defined by some protocol, eg __array_interface__
or __dlpack__
) instead of only numpy arrays.
At the moment this fails in the TypeInferrer for python objects. We have the following for sequence like objects:
So we currently support sets, lists, tuples and ndarrays (PyArray_Check
is from the numpy C API to check for numpy ndarrays).
And note that this has not directly to do with being nested, also for a flat list of pytorch tensors, you will get the same issue:
In [3]: pa.array([torch.zeros(5), torch.zeros(5)])
...
ArrowInvalid: Could not convert tensor([0., 0., 0., 0., 0.]) with type Tensor: did not recognize Python value type when inferring an Arrow data type
(but nesting in a dictionary makes it certainly harder to work around this, for the non-nested version above you can create a ListArray or FixedShapeTensorArray from a single dim+1 torch tensor)
Thanks @AlenkaF and @jorisvandenbossche ! Switching to Numpy fixes the issue for me (and I realize that the "nested" aspect was a red herring)
Describe the usage question you have. Please include as many useful details as possible.
Hi, I'm running into an issue while using Ray to write a nested dictionary of tensors to parquet. I'm posting here because from the stacktrace, it seems like the error is coming from pyarrow.
Minimal reproducible example:
The error:
I'm not familiar with parquet so this might just be a usage error. Is this type of nested datastructure supported? Currently I get around it by flattening and unflattening my dictionary, but I'm wondering if there's a better way to go about it.
Component(s)
Parquet