Bodo-inc / Bodo-Pandas-Collaboration

Shared repo used to track Pandas issues noted by Bodo.
0 stars 0 forks source link

Support reading lists of Parquet files with `read_parquet` #7

Open ehariri opened 2 years ago

ehariri commented 2 years ago

Pandas should support reading lists of parquet files which contain the same schema. Currently, the path argument of read_parquet in Pandas must direct to either a directory or a single file. However, users may want to read from a subset of files that are in a single directory or files from different directories. This is a common use case that other systems like Bodo, Spark, Dask, ... all support.

Assuming ~/path/to/pqs contains part1.pq and part2.pq with the same schema, we wish to do

pd.read_parquet(['~/path/to/pqs/part1.pq', '~/path/to/pqs/part2.pq'])

rather than read the files individually.

mroeschke commented 2 years ago

IIRC users previously raised issues about generally "passing a list of files in the read_* functions" and were rejected in favor of just looping over the files individually.

datapythonista commented 2 years ago

I found this issue that seems related: https://github.com/pandas-dev/pandas/issues/26388. Seems like we already support opening whole directories for parquet files. I also see this issue, which can be what you're referring to, not sure if there is any other: https://github.com/pandas-dev/pandas/issues/46039

jorisvandenbossche commented 2 years ago

Pyarrow already supports reading a list of files (and I suppose fastparquet as well? but didn't check) So for Parquet it is a matter of how we pass through the path argument (but we have a bunch of logic in _get_path_or_handle / get_handle which would need to be updated to also handle a list of paths)

ehsantn commented 2 years ago

Any reasons for not doing this? It makes life for users easier, and looks like most tools support it.