Open ehariri opened 2 years ago
IIRC users previously raised issues about generally "passing a list of files in the read_*
functions" and were rejected in favor of just looping over the files individually.
I found this issue that seems related: https://github.com/pandas-dev/pandas/issues/26388. Seems like we already support opening whole directories for parquet files. I also see this issue, which can be what you're referring to, not sure if there is any other: https://github.com/pandas-dev/pandas/issues/46039
Pyarrow already supports reading a list of files (and I suppose fastparquet as well? but didn't check)
So for Parquet it is a matter of how we pass through the path
argument (but we have a bunch of logic in _get_path_or_handle
/ get_handle
which would need to be updated to also handle a list of paths)
Any reasons for not doing this? It makes life for users easier, and looks like most tools support it.
Pandas should support reading lists of parquet files which contain the same schema. Currently, the
path
argument ofread_parquet
in Pandas must direct to either a directory or a single file. However, users may want to read from a subset of files that are in a single directory or files from different directories. This is a common use case that other systems like Bodo, Spark, Dask, ... all support.Assuming
~/path/to/pqs
containspart1.pq
andpart2.pq
with the same schema, we wish to dorather than read the files individually.