apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[Python] Allow serializing arbitrary Python objects to parquet #31267

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I'm trying to serialize a pandas DataFrame containing custom objects to parquet. Here is some example code:


import pandas as pd
import pyarrow as pa

class Foo: 
    pass

df = pd.DataFrame({"a": [Foo(), Foo(), Foo()], "b": [1, 2, 3]})
table = pyarrow.Table.from_pandas(df)

Gives me:


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 1782, in pyarrow.lib.Table.from_pandas
  File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays
    arrays = [convert_column(c, f)
  File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp>
    arrays = [convert_column(c, f)
  File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column
    raise e
  File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert <__main__.Foo object at 0x7fc23e38bfd0> with type Foo: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column a with type object')

Now, I realise that there's this disclaimer about arbitrary object serialization: https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization. However it isn't clear how this applies to parquet. In my case, I want to have a well-formed parquet file that has binary blobs in one column that can be deserialized to my class, but can otherwise be read by general parquet tools without failing. Using pickle doesn't solve this use case since other languages like R may not be able to read the pickle file.

Alternatively, if there is a well-defined protocol for telling pyarrow how to translate a given type to and from arrow types, I would be happy to use that instead.

 

Reporter: Michael Milton

Note: This issue was originally created as ARROW-15826. Please see the migration documentation for further details.

asfimport commented 2 years ago

Will Jones / @wjones127: Have you considered using a binary column? That should save to parquet just fine.


import pandas as pd
import pyarrow as pa
import pickle

class Foo: 
    pass

df = pd.DataFrame({"a": [pickle.dumps(Foo()) for _ in range(3)], "b": [1, 2, 3]})
table = pa.Table.from_pandas(df)
table
# pyarrow.Table
# a: binary
# b: int64
# ----
# a: [[80049517000000000000008C085F5F6D61696E5F5F948C03466F6F9493942981942E,80049517000000000000008C085F5F6D61696E5F5F948C03466F6F9493942981942E,80049517000000000000008C085F5F6D61696E5F5F948C03466F6F9493942981942E]]
# b: [[1,2,3]]
asfimport commented 2 years ago

Michael Milton: This approach would be fine, I just wonder if I can register some kind of automatic pickle serializer that will serialize/deserialize this for me.

asfimport commented 2 years ago

Will Jones / @wjones127: Yeah there's no hook like that in Table.to_pandas() or Table.from_pandas() to do that kind of custom conversion. Should be pretty straightforward to write a wrapping function that does that though.

asfimport commented 2 years ago

Michael Milton: So the pa.SerializationContext() won't work with Table.from_pandas? If not, can I make this into a feature request for that feature?

asfimport commented 2 years ago

Will Jones / @wjones127:

So the pa.SerializationContext() won't work with Table.from_pandas? That feature is deprecated since pickle now works just as well. If not, can I make this into a feature request for that feature? Could you describe the feature you have in mind? A mock code example showing how it would work would be helpful. Are you suggesting that Table.from_pandas should automatically pickle columns of dtype object?

asfimport commented 2 years ago

Will Jones / @wjones127: As an aside, this discussion on GitHub might provide some helpful context: https://github.com/apache/arrow/issues/11239

asfimport commented 2 years ago

Michael Milton: My motivation goes something like this: I have a DataFrame that contains a mix of primitive values and complex types. I want to serialize this in a way that can be read by any other language, and therefore pickle does not suffice. Since it's a DataFrame, and most of the data is primitive, it seems that the Arrow formats, including parquet would be a good fit.

So now the dilemma is: how can we serialize the complex types that are not native to Arrow? As you suggested, we could by default pickle all object Series, but in my mind this still ties us to Python too much. If I open it in R it's going to be impossible to access this data. So then my instinct is that we should have a customizable serializer for custom types that lets us choose the Arrow representation of a type. If I have a simple data class, I would want this to be represented as the best native type. Does parquet support nested Maps? If so, I would want to be able to tell the serializer that it should first convert my dataclass to dict, and then convert that to a parquet Map. If nested maps don't work then JSON or BSON is fine. However if I now have another class that only makes sense as a binary blob, like an image or something, then it would make more sense to serialize it as such.

Since you want some mock code, it would look a bit like this:


import dataclasses
import pandas as pd
import pyarrow as pa

@dataclasses.dataclass
class MyClass:
    a: str
    b: int
    c: bool

pa.register_serializer(MyClass, lambda instance: instance._todict())

df = pd.DataFrame({
    "a": [MyClass("a", 1, True), MyClass("b", 2, True), MyClass("c", 3, False)], 
    "b": [1, 2, 3]
})
table = pa.Table.from_pandas(df)
asfimport commented 2 years ago

Will Jones / @wjones127:

I want to serialize this in a way that can be read by any other language, and therefore pickle does not suffice. So maybe I misunderstood earlier what you meant by serialization. My initial impression was that you cared about saving Python object instances, and didn't care about portability. For example, some users have wanted to save a requests.Response object in a column as part of web-scraped data; that's a case where you have to choose between pickling and keeping all data, or convert to a format that other languages could read. But it sounds like you are designing the data structure?

Arrow and Parquet support nested types, including struct, list, and map columns. So if what you really care about is saving nested data, that's already possible. With the dataclass example you gave, all you have to do is convert the classes to dict; PyArrow can automatically convert those into nested Arrow types. And any parquet reader will be able to understand it.


from dataclasses import dataclass, asdict
import pandas as pd
import pyarrow as pa

@dataclass
class MyClass:
    a: str
    b: int
    c: bool

df = pd.DataFrame({
    "a": [MyClass("a", 1, True), MyClass("b", 2, True), MyClass("c", 3, False)], 
    "b": [1, 2, 3]
})

# PyArrow already knows how to convert lists and dicts into nested types
df['a'] = [asdict(x) for x in df['a']]

pa.Table.from_pandas(df)
# pyarrow.Table
# a: struct<a: string, b: int64, c: bool>
#   child 0, a: string
#   child 1, b: int64
#   child 2, c: bool
# b: int64
# ----
# a: [  -- is_valid: all not null  -- child 0 type: string
#     [
#       "a",
#       "b",
#       "c"
#     ]  -- child 1 type: int64
#     [
#       1,
#       2,
#       3
#     ]  -- child 2 type: bool
#     [
#       true,
#       true,
#       false
#     ]]
# b: [[1,2,3]]
asfimport commented 2 years ago

Michael Milton: Well I want my approach to be totally compatible with classes I don't control, which is why I want to be able to register serializers rather than doing it all manually. For example, I could check if any given object has a __dict__ and if so, convert to dict, otherwise pickle it. The requests.Response is something that I would like to be able to support, for example.