JuliaClimate / ClimateBase.jl

Tools to analyze and manipulate climate (spatiotemporal) data. Also used by ClimateTools and ClimatePlots
https://juliaclimate.github.io/ClimateBase.jl/dev/
Other
39 stars 3 forks source link

[FR] Out-of-memory data reduction. #48

Open Datseris opened 3 years ago

Datseris commented 3 years ago

While the in-memory functionality is great, it is typically the case that you have so much data that they don't fit to memory. Typically these data are saved in either monthly or yearly files, where each file contains one year of all the data, etc.

This is good for us, because at the moment it isn't hard to write a simple for-loop over your code. However we can streamline many things. For example, the output ClimArray can be pre-initialized and efficiently aggregated over, similarly to how yearlyagg works now.

So in principle there are two ways to do out-of-memory data reduction:

  1. Reduce by aggregating over time, by reducing the total amount of time-points and doing an out-of-memory version of yearlyagg and looping over the files.
  2. Reduce by projecting to a lower resolution grid. This is done for each time slice in the files, once again looping over files. This will require us to have #46 ready so that we can use it here.

The above is in my eyes easy, provided that the required issues are solved first.

The thing that is hard is also getting automatic parallelization to work here.

Balinus commented 3 years ago

You should look over at ESDL.jl and see how they implemented it. As far as I remember, they do out-of-memory reduction in a parallel manner. It is based on netCDF and Zarr chunks capabilities.