Optimize Herbie.download/subet with MultiThreading?

blaylockbk / Herbie

Download numerical weather prediction datasets (HRRR, RAP, GFS, IFS, etc.) from NOMADS, NODD partners (Amazon, Google, Microsoft), ECMWF open data, and the University of Utah Pando Archive System.

https://herbie.readthedocs.io/

MIT License

425 stars 70 forks source link

Optimize Herbie.download/subet with MultiThreading? #197

Open blaylockbk opened 1 year ago

blaylockbk commented 1 year ago

If a user requests downloads multiple GRIB messages, then the subset download function will download each (non-adjacent) GRIB message in a separate cURL download and appends to the same file. The downloads can be slow if a user requests many GRIB messages that are scattered throughout the full file causing many cURL downloads. Is it possible to optimize this by using MultiThreading?

Would need to cURL each message to its own temp file, then cat all the temp files together into one GRIB message.

In some cases, it may be faster to download the full file, then subset it by the grib messages you want (but that would require wgrib2, which isn't available on windows).

karlwx commented 1 year ago

In some cases, it may be faster to download the full file, then subset it by the grib messages you want (but that would require wgrib2, which isn't available on windows).

I definitely support this idea. This would also make sense for sources that don't have index files for whatever reason -- download the whole thing, subset with wgrib2, then remove the original file. It would need to be an option because of the windows issue, but it would be a useful option nonetheless!

For example, the Canadian model data doesn't have index files, although each parameter is in its own file... https://dd.weather.gc.ca/model_gem_global/

blaylockbk commented 11 months ago

Note to self: example of how to implement https://realpython.com/python-download-file-from-url/#using-a-pool-of-threads-with-the-requests-library