Open mgrover1 opened 3 years ago
For CMIP6 we used this code https://github.com/ncar/pyreshaper to convert to timeseries and it might be important to understand why there's a need to develop something new so that you don't run into the same issues. A couple issues that I would like to point out would be that the code relies on mpi4py which can be difficult to port. Another issue is that the code is slow converting high frequency output and you'll have to be careful about how you parallelize over the problem or a dask implementation could run into the same issue. The bottleneck is writing out the file in netcdf and there's not a great option to do this in parallel unless you're writing out in zarr. And my 2 cents ... if at all possible, this operation should really be done by PIO coming out of the model. Ignoring legacy runs or developement runs that turn into production runs, this operation creates duplicate data when disk space is already at a premium and this operation is hard to do without an email to cisl asking for more disk space. You might also want to talk to @strandwg for his advice on how this should be developed, especially because this task usually falls onto his to-do list.
I'm very much in favor of the simple philosophy. Maybe we need a simple one and a robust one?
@andrewgettelman and @mgrover1 The performance degradation with high frequency output is a "feature" of anything that you'll use to write using the netcdf format. For example, if you have two output streams written by cam, one that contains monthly output and the other one contains 6 hourly data, and both streams have the same exact total size, the monthly output is going to finish creating timeseries way before the 6 hourly data despite the monthly data writing more output in the end (because of meta data). This is because the parallelization is always going to be over the amount of output files that need to be written because netcdf file output is a serial operation. Since the monthly dataset contains more variables, you can spread that work across multiple ranks, with each rank writing its own file. The 6 hourly data contains less variables, but since the operation of writing that variable out is serial, one rank is responsible for writing many more time slices and no other ranks can help it out. The only way to improve the 6 hourly stream is to create more files for each variable, creating several smaller files per variable versus a couple larger ones. This would give you more parallelism and could make the files easier to work with when it comes to post-processing/diagnostic work.
So I guess my point is to just keep this in mind and try to exploit different ways you can parallelize the problem given the drawbacks and benefits of different I/O libraries.
It seems like one of the bigger questions is how to properly generate "analysis ready data", where it is in a state that can be easily incorporated into people's workflows... it seems like a good way to help solve this issue would be to use Xarray + Dask + Zarr, storing in chunks that can be more easily utilized in people's workflows. See this post/video from the UK Met Office - I see many similarities between the climate/weather data they are working with https://medium.com/informatics-lab/you-do-you-4272f48f4eb2. Obviously this will require additional discussion, there may be push back on moving away from writing to netcdf, but I think exploring this option is worth the time, as long as we can stress the value of making this move.
I guess the advantage of a simple write in python for concatenation is that it is a simple read back. NCO seems to do it too.
I have heard good things about zar files, but perhaps we need to see a demo? And if we are making single variable timeseries, do we need zar files?
Perhaps @mgrover1 you could make a simple demo with a pile of model output that makes timeseries files and zar files and we can take a look? I'm happy to modify my simple notebook to write zar files. Or you could. See here:
/glade/u/home/andrew/python/jlab/cesm/extract_variables3_clean.ipynb
That's a super simple concatenator. Can it be made to do zar files? I'm sure it can be parallelized with Dask, and then I'm sure you could wrap it into a more fancy version (though I like being able to open it up).
It seems like there are essentially two options here in terms of the underlying dependencies:
nrcat
andncks
) and workflow to submit numerous batch jobs to process from history to timeseriesHere is a link to the package design document, which is still a work in progress. Comments/feedback are greatly appreciated.