Open asfimport opened 4 years ago
Joris Van den Bossche / @jorisvandenbossche:
[~tazimmerman]
thanks for the report!
The actual cause of the metadata being lost when doing a column selection has been solved recently (-> ARROW-8802).
So at least, when including the index columns in your column selection, those will now be properly set as the index in the conversion to pandas (on master, to be released in 1.0):
In [55]: df = pd.DataFrame({"col1": range(9), "col2": np.arange(0, 0.9, 0.1)}, index=pd.MultiIndex.from_product([['A', 'B', 'C'], [1, 2, 3]], names=["level1", "level2"]))
In [56]: df
Out[56]:
col1 col2
level1 level2
A 1 0 0.0
2 1 0.1
3 2 0.2
B 1 3 0.3
2 4 0.4
3 5 0.5
C 1 6 0.6
2 7 0.7
3 8 0.8
In [58]: pq.write_table(tbl, "/tmp/df.parquet")
...: data = ds.dataset("/tmp/df.parquet")
In [63]: data.to_table(filter=ds.field("level1") == "C", columns=["level1", "level2", "col1"]).to_pandas()
Out[63]:
col1
level1 level2
C 1 6
2 7
3 8
However, when not including those columns in your selection, you still don't get the index:
In [64]: data.to_table(filter=ds.field("level1") == "C", columns=["col1"]).to_pandas()
Out[64]:
col1
0 6
1 7
2 8
But since the user explicitly lists the columns to select in this case, I am not fully sure whether we should automatically include the pandas columns or not.
In the parquet.read_table
functionality, we actually do this (triggered with a use_pandas_metadata
keyword):
In [72]: pq.read_table("/tmp/df.parquet", columns=["col1"]).to_pandas()
Out[72]:
col1
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
In [73]: pq.read_table("/tmp/df.parquet", columns=["col1"], use_pandas_metadata=True).to_pandas()
Out[73]:
col1
level1 level2
A 1 0
2 1
3 2
B 1 3
2 4
3 5
C 1 6
2 7
3 8
So where pyarrow will automatically add all index columns to the list of columns to select, if pandas metadata are available.
In principle this is a feature that we could also add to the Dataset.to_table
API (it shouldn't be too hard, since the actual implementation already exists in the python parquet.py code, so that can be reused. We only might need to do it in multiple places). But I am not fully sure if we should add that.
Troy Zimmerman: @jorisvandenbossche thank you for the quick response!
Presumably the dataset API uses the same parquet code under the hood as in your example, so I would vote for consistency, and add the use_pandas_metadata
argument to Dataset.to_table()
and pass it through to the parquet reader (if that makes sense). While doing it "auto-magically" is convenient, personally I'm a fan of the explicit over implicit rule.
I'm not sure if this is a missing feature, or just undocumented, or perhaps not even something I should expect to work.
Let's start with a multi-index dataframe.
This of course works as expected, so let's write the table to disk, and read it with a
dataset
.The filter also works as expected, and the dataframe is reconstructed properly. Let's do it again, but this time with a column selection.
Hmm, not quite what I was thinking, but excluding the indices from the columns seems like a dumb move on my part, so let's try again, and this time include all columns to be safe.
It seems that when I specify any or all columns, the schema metadata is lost along the way, so
to_pandas
doesn't reconstruct the dataframe to match the original.Here's my relevant versions:
Reporter: Troy Zimmerman
Note: This issue was originally created as ARROW-9302. Please see the migration documentation for further details.