Cached byte_range values for parallelization and time-series applications.

joshuaeh commented 1 year ago

Inspired by #40 and #213, I've forked the repo and I'm trying my hand at speeding up performance by caching data from the index file for repeated use on grib files so that I don't have to create a df and perform regex on each variable I'm interested in. Eventually I'd like to extend FastHerbie to help better with more timeseries applications.

I understand from previous discussions and stack overflow that AWS won't allow multiple byte range requests, however my rationale is that even without multi-processing, finding byte ranges once when doing this process iteratively should increase performance.

Hopefully it works. It's missing some of the for the multiple-byte-range requests, but I've started an implementation using requests. Implementation so far is here: https://github.com/joshuaeh/Herbie

Discussed in https://github.com/blaylockbk/Herbie/discussions/40

^{Originally posted by **adair-kovac** February 1, 2022} Hi @blaylockbk , I meant to at least write some benchmarks to verify and quantify this but it's been 2 months and I haven't done that so I'll just report an impression I got about the performance - The byte range selection for the HRRR *should* be significantly faster than downloading the whole GRIB2 file, and I believe it is if you use the boto3 library. But from a place with decent network speed (so not my home wifi, yes the CHPC or an AWS EC2 node), it's actually faster to just download the whole GRIB2 file than to select a certain field using herbie. I'm guessing that's due to curl overhead, though my second guess is that it could be due to whatever the process for indexing into the grib file is. I see from the code comments that you've thought about different ways of implementing the byte range selection, and I think it would be a good enhancement to herbie if that were reliably faster than downloading the whole file.

joshuaeh commented 1 year ago

My hope that each grib file had the same byte ranges used for each variable across multiple files did not come to fruition. Consider:

Changing the source order if selecting more than one byte range
Using joblib for each request separately then combining the data afterward

blaylockbk commented 1 year ago

Thanks for giving this some thought. I have thought about using multithreading to download each requested byte range (to speed up the download) and then cat them together after all portions are returned. Last time I played around with that idea my routine was a bit unsable (sometimes a thread would hang and stall the whole process). Probably worth revisiting.

blaylockbk commented 1 year ago

And yes, the fact that byte ranges are unique to each file does require Herbie to read the file-specific index file to know how to get the byte ranges.

blaylockbk / Herbie

Cached byte_range values for parallelization and time-series applications. #219

Discussed in https://github.com/blaylockbk/Herbie/discussions/40