apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.3k stars 3.47k forks source link

[Python] Get all columns names (or schema) from Feather file, before loading whole Feather file #18335

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Is there a way to get all column names (or schema) from a Feather file before loading the full Feather file?

My Feather files are big (like 100GB) and the names of the columns are different per analysis and can't be hard coded.


import pyarrow.feather as feather

# Code here to check which columns are in the feather file.
...
my_columns = ...

# Result is pandas.DataFrame
read_df = feather.read_feather('/path/to/file', columns=my_columns)

# Result is pyarrow.Table
read_arrow = feather.read_table('/path/to/file', columns=my_columns)

Reporter: Gert Hulselmans / @ghuls

Note: This issue was originally created as ARROW-10344. Please see the migration documentation for further details.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: @ghuls good question, this is not really well documented.

This is possible (at least starting with pyarrow 1.0), but not directly with the pyarrow.feather module.

Two options:

1) Since a feather file is basically the IPC serialization format written to a file, you can use the pyarrow.ipc functionality to interact with it (see http://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files).
Small example:


# writing a small file
import pyarrow as pa
from pyarrow import feather
table = pa.table({'a': [1, 2, 3], 'b': [.1, .2, .3]})
feather.write_feather(table, "data.feather")

In [13]: import pyarrow.ipc

In [14]: reader = pa.ipc.open_file("data.feather")

In [15]: reader
Out[15]: <pyarrow.ipc.RecordBatchFileReader at 0x7fe6d3d51798>

In [16]: reader.schema
Out[16]: 
a: int64
b: double

2) Use the new Datasets API (http://arrow.apache.org/docs/python/dataset.html):


In [17]: import pyarrow.dataset as ds

In [18]: dataset = ds.dataset("data.feather", format="feather")

In [19]: dataset.schema
Out[19]: 
a: int64
b: double

In [20]: dataset.to_table().to_pandas()
Out[20]: 
   a    b
0  1  0.1
1  2  0.2
2  3  0.3

In addition, this datasets API also allows do directly filter rows using an expression while reading (http://arrow.apache.org/docs/python/dataset.html#filtering-data), and also can read from a collection of (partitioned) files at once.

For both options you need Feather version 2 files (https://ursalabs.org/blog/2020-feather-v2/), so if you are already using Feather for a longer time (and have version 1 files), it might be worth to convert those.

asfimport commented 3 years ago

Gert Hulselmans / @ghuls: Thanks, this looks very helpful.

I was using Feather v2, but had to switch back to v1, due to a metadata bug when writing Feather v2 files: https://issues.apache.org/jira/browse/ARROW-10056

Is this pandas metadata very useful to have in my case? My feather files just contain one string column (row indices) and for the rest I have just columns of int16, int32, float32 (all other columns have the same type in one feather file).

For filtering the data, is there an easy way to create a filter for thousands of columns? e.g.: only return rows for which at least one column has a value < 5000?

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche:

For filtering the data, is there an easy way to create a filter for thousands of columns? e.g.: only return rows for which at least one column has a value < 5000?

I don't think we provide a direct way, but with some python utilities, you could construct such a filter. Eg with:


In [13]: import pyarrow.dataset as ds

In [14]: import operator

In [15]: import functools

In [16]: expr = functools.reduce(operator.or_, [ds.field(f"col{i}") < 5000 for i in range(5000)])

But I have no idea how that will perform if you use this as a filter (I don't think we really considered such usecase for filtering up to now, so not sure the expressions/filtering code are optimized for a filter with that many columns)

asfimport commented 3 years ago

Gert Hulselmans / @ghuls: Thanks. I can still do the filtering afterwards, so that is not a big problem.

Is there also something similar than pyarrow.dataset that allows combining multiple feather files with one common column (index, which is in the same order in all feather files) , while the other columns are different. It seems only appending rows is supported by pyarrow.dataset).

 


feather1:
index  col1  col2
1
...
n

feather2:
index  col3  col4 col5
1
...
n

feather3:
index  col6  col7 col 8
1
...
n

read feather1,2,3 as one combined table:
index  col1 col2 col3 col4 col5 col6  col7 col 8
1
...
n
asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: No, there is no way to concat horizontally with the datasets API.

From your descriptions and the type of data, I am wondering if Feather is actually a suited file format. It seems you have a big array like data (and numeric data, but with labeled dimensions), which is potentially very wide, and which you would ideally chunk both on the rows as the columns. For something like that, it might be interesting to look at the zarr format (https://zarr.readthedocs.io/en/stable/index.html, it was originally started by someone working on genome data, but is now also becoming popular in geoscience/climate data that historically uses a lot of NetCDF)

asfimport commented 3 years ago

Gert Hulselmans / @ghuls: We need the final data to be readable from Python and R, so Feather looked like a good choice.

To create the dataset the data is generated:

asfimport commented 3 years ago

Gert Hulselmans / @ghuls: Could reading the metadata schema for Feather v1 also be supported as for now I am required to have the files in v1 format due to https://issues.apache.org/jira/browse/ARROW-10056 or are there plans to fix that soon?

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche:

.. or are there plans to fix that soon?

See my new comment on that issue. TLDR I don't think it is solvable in general.

Could reading the metadata schema for Feather v1 also be supported

I think that is technically certainly possible. The C++ Reader interface already exposes a schema() function, but this is not exposed in Python. I suppose also for V2 this would be nice to have in the pyarrow.feather module.

We need the final data to be readable from Python and R, so Feather looked like a good choice.

That's indeed one of the selling points of Feather, and I also didn't find any up to date R interface for zarr.
I think it might still be worth looking for other options (giving the inherent limitation for V2 mentioned above). I don't have any experience with it myself, but might also be worth taking a look at TileDB.

If you want to stay with arrow/feather files, one other alternative is to use a "trick" of putting all columns (of the same type) in a FixedSizeList column (the data under the hood is then stored in a contiguous array, which can be easily "viewed" as a 2D array). However, then you can no longer read only a subset of the columns, which might be an important use case.

asfimport commented 3 years ago

al-hadi boublenza: Facing the same issue and wondering how to know if you're dealing with a Feather V1 or Feather V2 file? (Using pyarrow)

asfimport commented 3 years ago

Gert Hulselmans / @ghuls: [~weldingwelding] The first 4/6 bytes (and last 4/6 bytes) of the Feather file would tell you. For example, you can check it with hexdump.


 ❯ hexdump -C -n 8 feather_version1.feather
 00000000 46 45 41 31 00 00 00 00 |FEA1....|
 00000008

❯ hexdump -C -n 8 feather_version2.feather
 00000000 41 52 52 4f 57 31 00 00 |ARROW1..|
 00000008

 def feather_v1_or_v2(feather_file):
 with open(feather_file, 'rb') as fh_feather:
 fh_feather.seek(0, 0)
 feather_v1_magic_bytes_header = fh_feather.read(4)
 fh_feather.seek(-4, 2)
 feather_v1_magic_bytes_footer = fh_feather.read(4)

if feather_v1_magic_bytes_header == feather_v1_magic_bytes_footer == b'FEA1':
 return 1

fh_feather.seek(0, 0)
 feather_v2_magic_bytes_header = fh_feather.read(6)
 fh_feather.seek(-6, 2)
 feather_v2_magic_bytes_footer = fh_feather.read(6)

if feather_v2_magic_bytes_header == feather_v2_magic_bytes_footer == b'ARROW1':
 return 2

return None

@jorisvandenbossche Now that https://issues.apache.org/jira/browse/ARROW-10056 is resolved, Feather v1 support is less critical for me. so the IPC and dataset API workaround are now useful for me. It still would be good to have Feather v1 support and exposure of the columns in the feather submodule directly.

asfimport commented 3 years ago

Gert Hulselmans / @ghuls: I figured out a way to get the column names for a feather v1 and v2 file.

For getting the info from a Feather v1 file, I used the flatbuffer file:


# Download FlatBuffer Feather v1 schema.
wget https://github.com/apache/arrow/raw/master/cpp/src/arrow/ipc/feather.fbs

# Create python code for FlatBuffer Feather v1 schema.
flatc --python feather.fbs

import numpy as np
import pyarrow.dataset as ds

# wget https://github.com/apache/arrow/raw/master/cpp/src/arrow/ipc/feather.fbs
# flatc --python feather.fbs
import feather.fbs.CTable as feather_v1_fbs

def is_feather_v1_or_v2(feather_file):
    with open(feather_file, 'rb') as fh_feather:
        # Read first 6 and last 6 bytes to see if we have a Feather v2 file.
        fh_feather.seek(0, 0)
        feather_v2_magic_bytes_header = fh_feather.read(6)
        fh_feather.seek(-6, 2)
        feather_v2_magic_bytes_footer = fh_feather.read(6)

        if feather_v2_magic_bytes_header == feather_v2_magic_bytes_footer == b'ARROW1':
           return 2

        # Read first 4 and last 4 bytes to see if we have a Feather v1 file.
        feather_v1_magic_bytes_header = feather_v2_magic_bytes_header[0:4]
        feather_v1_magic_bytes_footer = feather_v2_magic_bytes_footer[2:]

        if feather_v1_magic_bytes_header == feather_v1_magic_bytes_footer == b'FEA1':
            return 1

    return None

def get_column_names_from_feather(feather_file):
    feather_v1_or_v2 = is_feather_v1_or_v2(feather_file)

    if feather_v1_or_v2 == 1:
        with open(feather_file, 'rb') as fh_feather:
            fh_feather.seek(-8, 2)

            # Get Feather v1 metadata length.
            metadata_length = np.frombuffer(fh_feather.read(4), dtype=np.int32)[0]

            # Read Feather v1 metadata.
            fh_feather.seek(- (metadata_length +  8), 2)
            feather_metadata = feather_v1_fbs.CTable.GetRootAsCTable(bytearray(fh_feather.read(metadata_length)), 0)

            num_columns = feather_metadata.ColumnsLength()

            column_names = [
                feather_metadata.Columns(column_idx).Name().decode('utf-8')
                for column_idx in range(0, num_columns)
            ]
    elif feather_v1_or_v2 == 2:
        feather_v2_dataset = ds.dataset(feather_file, format="feather")
        column_names = feather_v2_dataset.schema.names
    else:
        return None

    return column_names

In [7]: is_feather_v1_or_v2('test/ct_rankings_db_genes_vs_tracks.feather_version1.genes_vs_tracks.rankings.feather')
   ...:
Out[7]: 1

In [8]: is_feather_v1_or_v2('test/ct_rankings_db_genes_vs_tracks.feather_version2.genes_vs_tracks.rankings.feather')
Out[8]: 2

In [9]: get_column_names_from_feather('test/ct_rankings_db_genes_vs_tracks.feather_version1.genes_vs_tracks.rankings.feather')
Out[9]: ['gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'tracks']

In [10]: get_column_names_from_feather('test/ct_rankings_db_genes_vs_tracks.feather_version2.genes_vs_tracks.rankings.feather')
Out[10]: ['gene1', 'gene2', 'gene3', 'gene4', 'gene5', 'gene6', 'gene7', 'tracks']
asfimport commented 3 years ago

Gert Hulselmans / @ghuls: @jorisvandenbossche Are there plans to add reading column information from Feather v1 files before loading them from Python (and R (didn't look at it yet, but we need this support in R too)).

asfimport commented 3 years ago

Gert Hulselmans / @ghuls: Combined the above snippets in a cleaner way: https://github.com/aertslab/create_cisTarget_databases/commit/dcf70e60e915d2dc6850343960e7a7d3d3d56c41

asfimport commented 1 year ago

Gert Hulselmans / @ghuls: @jorisvandenbossche Will it ever be possible to get the schema for a Feather v1 file in pyarrow without needing to load the Feather file first?


import pyarrow.feather as pf

# E.g. a schema option for FeatherDataset.
pf.FeatherDataset(["test.v1feather]).schema