If I put on my hat of being an energy forecasting ML researchers, then one of the "dreams" would be to be able to use a single on-disk dataset (e.g. 500 TBytes of NWPs) for multiple ML experiments:
a neural net, which takes in dense imagery from NWPs and satellite imagery, covering the same regions in space and time
an XGBoost model to forecast solar PV power for a handful of specific sites. For each site, the input might be a single "pixel" (single lat lon location), across time.
If the data is chunked on disk to support use-case 1 (the neural net) then we might use chunks something like y=128, x=128, t=1, c=10. But that sucks for use-case 2 (which only wants a single pixel).
So it'd be nice to have a tool to:
easily extract long timeseries for a handful of sparse locations, and maybe save those in chunk sizes of something like y=1, x=1, t=4096, c=10
append to these timeseries
automatically append to the timeseries datasets when new timesteps are added to the dense dataset?
Maybe the ideal would be for the user to be able to express these conversions in a few lines of python, perhaps using xarray, whilst still saturating the IO (e.g. a cloud instance with a 200 Gbps NIC, reading and writing from object storage). The user shouldn't have to worry about parallelising stuff.
Perhaps you'd have multiple on-disk datasets (each optimised for a different read pattern). But the user wouldn't have to manually manage these multiple datasets. Instead the user would interact with a "multi-dataset" layer would would manage the underlying datasets (see #142).
If I put on my hat of being an energy forecasting ML researchers, then one of the "dreams" would be to be able to use a single on-disk dataset (e.g. 500 TBytes of NWPs) for multiple ML experiments:
If the data is chunked on disk to support use-case 1 (the neural net) then we might use chunks something like
y=128, x=128, t=1, c=10
. But that sucks for use-case 2 (which only wants a single pixel).So it'd be nice to have a tool to:
y=1, x=1, t=4096, c=10
Maybe the ideal would be for the user to be able to express these conversions in a few lines of python, perhaps using xarray, whilst still saturating the IO (e.g. a cloud instance with a 200 Gbps NIC, reading and writing from object storage). The user shouldn't have to worry about parallelising stuff.
Perhaps you'd have multiple on-disk datasets (each optimised for a different read pattern). But the user wouldn't have to manually manage these multiple datasets. Instead the user would interact with a "multi-dataset" layer would would manage the underlying datasets (see #142).