Deciding on Underlying Dependencies/Design

mgrover1 commented 3 years ago

It seems like there are essentially two options here in terms of the underlying dependencies:

Use NCO (nrcat and ncks) and workflow to submit numerous batch jobs to process from history to timeseries
Use underlying functionality in Xarray + Dask to convert to suitable sized chunks, making use of the Rechunker tool within the Pangeo stack

Here is a link to the package design document, which is still a work in progress. Comments/feedback are greatly appreciated.

sherimickelson commented 3 years ago

For CMIP6 we used this code https://github.com/ncar/pyreshaper to convert to timeseries and it might be important to understand why there's a need to develop something new so that you don't run into the same issues. A couple issues that I would like to point out would be that the code relies on mpi4py which can be difficult to port. Another issue is that the code is slow converting high frequency output and you'll have to be careful about how you parallelize over the problem or a dask implementation could run into the same issue. The bottleneck is writing out the file in netcdf and there's not a great option to do this in parallel unless you're writing out in zarr. And my 2 cents ... if at all possible, this operation should really be done by PIO coming out of the model. Ignoring legacy runs or developement runs that turn into production runs, this operation creates duplicate data when disk space is already at a premium and this operation is hard to do without an email to cisl asking for more disk space. You might also want to talk to @strandwg for his advice on how this should be developed, especially because this task usually falls onto his to-do list.

andrewgettelman commented 3 years ago

I'm very much in favor of the simple philosophy. Maybe we need a simple one and a robust one?

I have a few lines of python code that just read in data with xarray (open_mfdataset) and then loop over desired variables. The open seems pretty slow (could be improved I think), but the write is not too bad. This seems to create timeseries that work fine. I also like it because I can create new variables in the dataset before I do it, or I can extract a level from a 3D field.
I am sure there are better ways to do it, but I would start perhaps by maybe doing something with Dask to speed it up, or to parallelize the variable extraction/writes. But I see no problem with what I have done so far.
I can see wanting to rename a variable before writing...
The NCO way we should probably avoid. I think I have a version that calls NCO as well to pull out variables, but that violates the simple part
I would have to be convinced that anything fancier is needed. I am concerned about Sheri's comment that the pyreshaper code 'relies on mpi4py which can be difficult to port', and that it slows on high frequency output.
I think from what I have been hearing, a zar option would be good. I understand this would make future reads much faster? Though maybe more for multiple variable files.
Regarding doing timeseries as they come out of the model: I recognize that Sheri has identified the ultimate solution. But that needs a discussion by CSEG and the WG co-chairs. We need this concatenation tool anyway to work on legacy output, but we should recognize it perhaps does not need to be a 'Cadillac' solution.
I am fine if the tool is just a jupyter notebook with a few commands. Happy if you want to wrap it into a function or whatever, but I think it could be left exposed for people to modify. Or just have different versions: here is the flat notebook, here is a package you can import, etc.

sherimickelson commented 3 years ago

@andrewgettelman and @mgrover1 The performance degradation with high frequency output is a "feature" of anything that you'll use to write using the netcdf format. For example, if you have two output streams written by cam, one that contains monthly output and the other one contains 6 hourly data, and both streams have the same exact total size, the monthly output is going to finish creating timeseries way before the 6 hourly data despite the monthly data writing more output in the end (because of meta data). This is because the parallelization is always going to be over the amount of output files that need to be written because netcdf file output is a serial operation. Since the monthly dataset contains more variables, you can spread that work across multiple ranks, with each rank writing its own file. The 6 hourly data contains less variables, but since the operation of writing that variable out is serial, one rank is responsible for writing many more time slices and no other ranks can help it out. The only way to improve the 6 hourly stream is to create more files for each variable, creating several smaller files per variable versus a couple larger ones. This would give you more parallelism and could make the files easier to work with when it comes to post-processing/diagnostic work.

So I guess my point is to just keep this in mind and try to exploit different ways you can parallelize the problem given the drawbacks and benefits of different I/O libraries.

mgrover1 commented 3 years ago

It seems like one of the bigger questions is how to properly generate "analysis ready data", where it is in a state that can be easily incorporated into people's workflows... it seems like a good way to help solve this issue would be to use Xarray + Dask + Zarr, storing in chunks that can be more easily utilized in people's workflows. See this post/video from the UK Met Office - I see many similarities between the climate/weather data they are working with https://medium.com/informatics-lab/you-do-you-4272f48f4eb2. Obviously this will require additional discussion, there may be push back on moving away from writing to netcdf, but I think exploring this option is worth the time, as long as we can stress the value of making this move.

andrewgettelman commented 3 years ago

I guess the advantage of a simple write in python for concatenation is that it is a simple read back. NCO seems to do it too.

I have heard good things about zar files, but perhaps we need to see a demo? And if we are making single variable timeseries, do we need zar files?

Perhaps @mgrover1 you could make a simple demo with a pile of model output that makes timeseries files and zar files and we can take a look? I'm happy to modify my simple notebook to write zar files. Or you could. See here:

/glade/u/home/andrew/python/jlab/cesm/extract_variables3_clean.ipynb

That's a super simple concatenator. Can it be made to do zar files? I'm sure it can be parallelized with Dask, and then I'm sure you could wrap it into a more fancy version (though I like being able to open it up).

NCAR / cesm-hist2tseries

Deciding on Underlying Dependencies/Design #13