[HDFS] Add HDFS support for parquet files stored in HDFS systems

Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust

https://getdaft.io

Apache License 2.0

2.15k stars 144 forks source link

[HDFS] Add HDFS support for parquet files stored in HDFS systems #2786

Open MisterKloudy opened 3 weeks ago

MisterKloudy commented 3 weeks ago

I have been using pyarrow's pa.hdfs.connect() and pq.ParquetDataset to read files before using daft to read from pyarrow. The alternative is to simply use pandas' read_parquet and then daft's from_pandas. However, this is extremely slow and often leads to memory related errors. If Daft could work on this file system then it would be easier for me to reuse the same code across sources and also fall back less on spark.

samster25 commented 3 weeks ago

@chuanlei-coding is actually taking a stab at this!

https://github.com/Eventual-Inc/Daft/pull/2787

MisterKloudy commented 3 weeks ago

Super cool!! Thanks for working on this!