Open asfimport opened 2 years ago
Will Jones / @wjones127: Have you considered using a binary column? That should save to parquet just fine.
import pandas as pd
import pyarrow as pa
import pickle
class Foo:
pass
df = pd.DataFrame({"a": [pickle.dumps(Foo()) for _ in range(3)], "b": [1, 2, 3]})
table = pa.Table.from_pandas(df)
table
# pyarrow.Table
# a: binary
# b: int64
# ----
# a: [[80049517000000000000008C085F5F6D61696E5F5F948C03466F6F9493942981942E,80049517000000000000008C085F5F6D61696E5F5F948C03466F6F9493942981942E,80049517000000000000008C085F5F6D61696E5F5F948C03466F6F9493942981942E]]
# b: [[1,2,3]]
Michael Milton: This approach would be fine, I just wonder if I can register some kind of automatic pickle serializer that will serialize/deserialize this for me.
Will Jones / @wjones127: Yeah there's no hook like that in Table.to_pandas() or Table.from_pandas() to do that kind of custom conversion. Should be pretty straightforward to write a wrapping function that does that though.
Michael Milton:
So the pa.SerializationContext()
won't work with Table.from_pandas? If not, can I make this into a feature request for that feature?
Will Jones / @wjones127:
So the
pa.SerializationContext()
won't work with Table.from_pandas? That feature is deprecated since pickle now works just as well. If not, can I make this into a feature request for that feature? Could you describe the feature you have in mind? A mock code example showing how it would work would be helpful. Are you suggesting that Table.from_pandas should automatically pickle columns of dtype object?
Will Jones / @wjones127: As an aside, this discussion on GitHub might provide some helpful context: https://github.com/apache/arrow/issues/11239
Michael Milton: My motivation goes something like this: I have a DataFrame that contains a mix of primitive values and complex types. I want to serialize this in a way that can be read by any other language, and therefore pickle does not suffice. Since it's a DataFrame, and most of the data is primitive, it seems that the Arrow formats, including parquet would be a good fit.
So now the dilemma is: how can we serialize the complex types that are not native to Arrow? As you suggested, we could by default pickle all object Series, but in my mind this still ties us to Python too much. If I open it in R it's going to be impossible to access this data. So then my instinct is that we should have a customizable serializer for custom types that lets us choose the Arrow representation of a type. If I have a simple data class, I would want this to be represented as the best native type. Does parquet support nested Maps? If so, I would want to be able to tell the serializer that it should first convert my dataclass to dict, and then convert that to a parquet Map. If nested maps don't work then JSON or BSON is fine. However if I now have another class that only makes sense as a binary blob, like an image or something, then it would make more sense to serialize it as such.
Since you want some mock code, it would look a bit like this:
import dataclasses
import pandas as pd
import pyarrow as pa
@dataclasses.dataclass
class MyClass:
a: str
b: int
c: bool
pa.register_serializer(MyClass, lambda instance: instance._todict())
df = pd.DataFrame({
"a": [MyClass("a", 1, True), MyClass("b", 2, True), MyClass("c", 3, False)],
"b": [1, 2, 3]
})
table = pa.Table.from_pandas(df)
Will Jones / @wjones127:
I want to serialize this in a way that can be read by any other language, and therefore pickle does not suffice. So maybe I misunderstood earlier what you meant by serialization. My initial impression was that you cared about saving Python object instances, and didn't care about portability. For example, some users have wanted to save a
requests.Response
object in a column as part of web-scraped data; that's a case where you have to choose between pickling and keeping all data, or convert to a format that other languages could read. But it sounds like you are designing the data structure?
Arrow and Parquet support nested types, including struct, list, and map columns. So if what you really care about is saving nested data, that's already possible. With the dataclass example you gave, all you have to do is convert the classes to dict; PyArrow can automatically convert those into nested Arrow types. And any parquet reader will be able to understand it.
from dataclasses import dataclass, asdict
import pandas as pd
import pyarrow as pa
@dataclass
class MyClass:
a: str
b: int
c: bool
df = pd.DataFrame({
"a": [MyClass("a", 1, True), MyClass("b", 2, True), MyClass("c", 3, False)],
"b": [1, 2, 3]
})
# PyArrow already knows how to convert lists and dicts into nested types
df['a'] = [asdict(x) for x in df['a']]
pa.Table.from_pandas(df)
# pyarrow.Table
# a: struct<a: string, b: int64, c: bool>
# child 0, a: string
# child 1, b: int64
# child 2, c: bool
# b: int64
# ----
# a: [ -- is_valid: all not null -- child 0 type: string
# [
# "a",
# "b",
# "c"
# ] -- child 1 type: int64
# [
# 1,
# 2,
# 3
# ] -- child 2 type: bool
# [
# true,
# true,
# false
# ]]
# b: [[1,2,3]]
Michael Milton:
Well I want my approach to be totally compatible with classes I don't control, which is why I want to be able to register serializers rather than doing it all manually. For example, I could check if any given object has a __dict__
and if so, convert to dict, otherwise pickle it. The requests.Response
is something that I would like to be able to support, for example.
I'm trying to serialize a pandas DataFrame containing custom objects to parquet. Here is some example code:
Gives me:
Now, I realise that there's this disclaimer about arbitrary object serialization: https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization. However it isn't clear how this applies to parquet. In my case, I want to have a well-formed parquet file that has binary blobs in one column that can be deserialized to my class, but can otherwise be read by general parquet tools without failing. Using pickle doesn't solve this use case since other languages like R may not be able to read the pickle file.
Alternatively, if there is a well-defined protocol for telling pyarrow how to translate a given type to and from arrow types, I would be happy to use that instead.
Reporter: Michael Milton
Note: This issue was originally created as ARROW-15826. Please see the migration documentation for further details.