apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.43k stars 3.51k forks source link

[Python] Specifying columns in a dataset drops the index (pandas) metadata. #25391

Open asfimport opened 4 years ago

asfimport commented 4 years ago

I'm not sure if this is a missing feature, or just undocumented, or perhaps not even something I should expect to work.

Let's start with a multi-index dataframe.


>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>> import pyarrow.parquet as pq
>>>
>>> df
               data  id                      when
letter number
a      1        0.0  a1 2020-05-05 08:30:01+00:00
b      2        1.1  b2 2020-05-05 08:30:01+00:00
       3        1.2  b3 2020-05-05 08:30:01+00:00
c      4        2.1  c4 2020-05-05 08:30:01+00:00
       5        2.2  c5 2020-05-05 08:30:01+00:00
       6        2.3  c6 2020-05-05 08:30:01+00:00

>>> tbl = pa.Table.from_pandas(df)
>>> tbl
pyarrow.Table
data: double
id: string
when: timestamp[ns, tz=+00:00]
letter: string
number: int64
>>> tbl.schema
data: double
id: string
when: timestamp[ns, tz=+00:00]
letter: string
number: int64
-- schema metadata --
pandas: '{"index_columns": ["letter", "number"], "column_indexes": [{"nam' + 783

This of course works as expected, so let's write the table to disk, and read it with a dataset.


>>> pq.write_table(tbl, "/tmp/df.parquet")
>>> data = ds.dataset("/tmp/df.parquet")
>>> data.to_table(filter=ds.field("letter") == "c").to_pandas()
               data  id                      when
letter number
c      4        2.1  c4 2020-05-05 08:30:01+00:00
       5        2.2  c5 2020-05-05 08:30:01+00:00
       6        2.3  c6 2020-05-05 08:30:01+00:00

The filter also works as expected, and the dataframe is reconstructed properly. Let's do it again, but this time with a column selection.


>>> data.to_table(filter=ds.field("letter") == "c", columns=["data", "id"]).to_pandas()
   data  id
0   2.1  c4
1   2.2  c5
2   2.3  c6

Hmm, not quite what I was thinking, but excluding the indices from the columns seems like a dumb move on my part, so let's try again, and this time include all columns to be safe.


>>> tbl = data.to_table(filter=ds.field("letter") == "c", columns=["letter", "number", "data", "id", "when"])
>>> tbl.to_pandas()
  letter  number  data  id                      when
0      c       4   2.1  c4 2020-05-05 08:30:01+00:00
1      c       5   2.2  c5 2020-05-05 08:30:01+00:00
2      c       6   2.3  c6 2020-05-05 08:30:01+00:00
>>> tbl
pyarrow.Table
letter: string
number: int64
data: double
id: string
when: timestamp[us, tz=UTC]
>>> tbl.schema
letter: string
  -- field metadata --
  PARQUET:field_id: '4'
number: int64
  -- field metadata --
  PARQUET:field_id: '5'
data: double
  -- field metadata --
  PARQUET:field_id: '1'
id: string
  -- field metadata --
  PARQUET:field_id: '2'
when: timestamp[us, tz=UTC]
  -- field metadata --
  PARQUET:field_id: '3'

It seems that when I specify any or all columns, the schema metadata is lost along the way, so to_pandas doesn't reconstruct the dataframe to match the original.

Here's my relevant versions:

Reporter: Troy Zimmerman

Note: This issue was originally created as ARROW-9302. Please see the migration documentation for further details.

asfimport commented 4 years ago

Joris Van den Bossche / @jorisvandenbossche: [~tazimmerman] thanks for the report!

The actual cause of the metadata being lost when doing a column selection has been solved recently (-> ARROW-8802).

So at least, when including the index columns in your column selection, those will now be properly set as the index in the conversion to pandas (on master, to be released in 1.0):


In [55]: df = pd.DataFrame({"col1": range(9), "col2": np.arange(0, 0.9, 0.1)}, index=pd.MultiIndex.from_product([['A', 'B', 'C'], [1, 2, 3]], names=["level1", "level2"]))                                         

In [56]: df                                                                                                                                                                                                        
Out[56]: 
               col1  col2
level1 level2            
A      1          0   0.0
       2          1   0.1
       3          2   0.2
B      1          3   0.3
       2          4   0.4
       3          5   0.5
C      1          6   0.6
       2          7   0.7
       3          8   0.8

In [58]: pq.write_table(tbl, "/tmp/df.parquet") 
    ...: data = ds.dataset("/tmp/df.parquet")   

In [63]: data.to_table(filter=ds.field("level1") == "C", columns=["level1", "level2", "col1"]).to_pandas()                                                                                                         
Out[63]: 
               col1
level1 level2      
C      1          6
       2          7
       3          8

However, when not including those columns in your selection, you still don't get the index:


In [64]: data.to_table(filter=ds.field("level1") == "C", columns=["col1"]).to_pandas()                                                                                                                             
Out[64]: 
   col1
0     6
1     7
2     8

But since the user explicitly lists the columns to select in this case, I am not fully sure whether we should automatically include the pandas columns or not.
In the parquet.read_table functionality, we actually do this (triggered with a use_pandas_metadata keyword):


In [72]: pq.read_table("/tmp/df.parquet", columns=["col1"]).to_pandas()                                                                                                                                            
Out[72]: 
   col1
0     0
1     1
2     2
3     3
4     4
5     5
6     6
7     7
8     8

In [73]: pq.read_table("/tmp/df.parquet", columns=["col1"], use_pandas_metadata=True).to_pandas()                                                                                                                  
Out[73]: 
               col1
level1 level2      
A      1          0
       2          1
       3          2
B      1          3
       2          4
       3          5
C      1          6
       2          7
       3          8

So where pyarrow will automatically add all index columns to the list of columns to select, if pandas metadata are available.

In principle this is a feature that we could also add to the Dataset.to_table API (it shouldn't be too hard, since the actual implementation already exists in the python parquet.py code, so that can be reused. We only might need to do it in multiple places). But I am not fully sure if we should add that.

asfimport commented 4 years ago

Troy Zimmerman: @jorisvandenbossche thank you for the quick response!

Presumably the dataset API uses the same parquet code under the hood as in your example, so I would vote for consistency, and add the use_pandas_metadata argument to Dataset.to_table() and pass it through to the parquet reader (if that makes sense). While doing it "auto-magically" is convenient, personally I'm a fan of the explicit over implicit rule.