Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.76k stars 105 forks source link

[FEAT] Local optimized JSONLines reader #2196

Open samster25 opened 2 weeks ago

samster25 commented 2 weeks ago

We currently have an optimized multithreaded jsonlines reader for reading from cloud storage that is based rust async/await and tokio. This lets us one of the fastest ways to read JSON from s3 as seen here.

This function currently shares the same code path for s3/gcs/azure as well as local. This leads to it being much slower for local reads. we noticed the same issue for our parquet reader and added a "local" path that checks if the file is local and then uses a parquet reader that is optimized for local seekable files. This gave us a massive speed up!

We would like to do the same for json lines and have something on par with the pyarrow reader for jsons lines.

Some files to benchmark with: https://daft-public-data.s3.us-west-2.amazonaws.com/redpajama-1t-sample/stackexchange_sample.jsonl https://daft-public-data.s3.us-west-2.amazonaws.com/melbourne-airbnb/melbourne_airbnb.csv

samster25 commented 2 weeks ago

@universalmind303 Here's the ticket I had in mind for tomorrow's chat! PTAL and we can go deeper.

samster25 commented 2 weeks ago

cc-ing @clarkzinzow to field questions about the existing Jsonlines reader

universalmind303 commented 2 weeks ago

we noticed the same issue for our parquet reader and added a "local" path that checks if the file is local and then uses a parquet reader that is optimized for local seekable files

This example provides a lot of context on how this should be implemented. Thanks!