JackKelly / light-speed-io

Read & decompress many chunks of files at high speed
MIT License
58 stars 0 forks source link

Utility to extract, reshape, and store data for a subset of the data. e.g. for extracting timeseries for single PV sites from gridded NWPs #141

Open JackKelly opened 3 months ago

JackKelly commented 3 months ago

If I put on my hat of being an energy forecasting ML researchers, then one of the "dreams" would be to be able to use a single on-disk dataset (e.g. 500 TBytes of NWPs) for multiple ML experiments:

  1. a neural net, which takes in dense imagery from NWPs and satellite imagery, covering the same regions in space and time
  2. an XGBoost model to forecast solar PV power for a handful of specific sites. For each site, the input might be a single "pixel" (single lat lon location), across time.

If the data is chunked on disk to support use-case 1 (the neural net) then we might use chunks something like y=128, x=128, t=1, c=10. But that sucks for use-case 2 (which only wants a single pixel).

So it'd be nice to have a tool to:

Maybe the ideal would be for the user to be able to express these conversions in a few lines of python, perhaps using xarray, whilst still saturating the IO (e.g. a cloud instance with a 200 Gbps NIC, reading and writing from object storage). The user shouldn't have to worry about parallelising stuff.

Perhaps you'd have multiple on-disk datasets (each optimised for a different read pattern). But the user wouldn't have to manually manage these multiple datasets. Instead the user would interact with a "multi-dataset" layer would would manage the underlying datasets (see #142).