Open samster25 opened 2 weeks ago
@universalmind303 Here's the ticket I had in mind for tomorrow's chat! PTAL and we can go deeper.
cc-ing @clarkzinzow to field questions about the existing Jsonlines reader
we noticed the same issue for our parquet reader and added a "local" path that checks if the file is local and then uses a parquet reader that is optimized for local seekable files
This example provides a lot of context on how this should be implemented. Thanks!
We currently have an optimized multithreaded jsonlines reader for reading from cloud storage that is based rust async/await and tokio. This lets us one of the fastest ways to read JSON from s3 as seen here.
This function currently shares the same code path for s3/gcs/azure as well as local. This leads to it being much slower for local reads. we noticed the same issue for our parquet reader and added a "local" path that checks if the file is local and then uses a parquet reader that is optimized for local seekable files. This gave us a massive speed up!
We would like to do the same for json lines and have something on par with the pyarrow reader for jsons lines.
Some files to benchmark with: https://daft-public-data.s3.us-west-2.amazonaws.com/redpajama-1t-sample/stackexchange_sample.jsonl https://daft-public-data.s3.us-west-2.amazonaws.com/melbourne-airbnb/melbourne_airbnb.csv