CliMA / LESbrary.jl

📚Generating Oceananigans large eddy simulation (LES) data for calibrating parameterizations
MIT License
28 stars 10 forks source link

Comprehensive LESbrary plots/movies #56

Closed ali-ramadhan closed 3 years ago

ali-ramadhan commented 3 years ago

This PR uses NetCDF output and GeoData.jl to make comprehensive plots/movies so it's easy to look at what all the fields look like and what the statistics look like.

But I'm opening it as a draft since it would be good to be able to select between JLD2 and NetCDF, and it would be good to generalize the plotting/animating to work with other examples/scripts.

image

image

ali-ramadhan commented 3 years ago

Yeah I'm not advocating for a total switch to NetCDF which is why I'm hoping to add a command line argument

--output {netcdf,jld2}

to allow LESbrary.jl users to switch between the two.

I think it's important to support NetCDF output, and especially when making the LESbrary.jl data public it should be in NetCDF. It is the de-facto standard for distributing model output in oceanography (and atmospheric + climate science) with a huge number of mature tools for working with NetCDF.

Mostly just worried about being hamstrung in the future by NetCDF's inflexibility.

I agree that NetCDF is not as flexible because JLD2 can serialize almost anything to disk while NetCDF is hamstrung by the NetCDF C library. However, I have yet to encounter any true limitations to using NetCDF for Oceananigans output. Have you?

If I encounter some data that cannot be saved with NetCDF, I would use JLD2 for that. But I have not come across this situation yet.

Personal opinion: NetCDF was made to store and distribute exactly the kind of model output we are generating. And being able to make use of the large number of tools/packages for working with NetCDF (e.g. GeoData.jl, xarray, Panoply, etc.) I think overshadows the limitation of NetCDF not being able to serialize everything to disk.

Personal opinion continued: I also can't bring myself to use JLD2 because plotting JLD2 output requires more lines and data wrangling than using GeoData.jl or xarray with NetCDF. And I'm still somewhat reliant on Python + xarray + joblib for fast and easy parallel plotting. Despite multiple attempts at parallel plotting in Julia (I've tried using Distributed, Dagger.jl, and FLoops.jl) the xarray + joblib Python solution is still way better.

Why does GeoData only work with NetCDF?

I guess the author of GeoData.jl found it worthwhile to add support for NetCDF presumably since they believe it's the most useful feature for potential users. After all, most of the model output data out there is in NetCDF.

Nothing is stopping someone from adding support for JLD2. Although some choices would probably have to made around how the geophysical data is stored in JLD2. NetCDF has a standardized way of storing the data, dimensions, and variables so GeoData.jl should support reading virtually any NetCDF file out there.

Note that GeoData.jl also supports reading GRD, GDAL, and SMAP HDF5 files: https://rafaqz.github.io/GeoData.jl/dev/#GeoData

glwagner commented 3 years ago

However, I have yet to encounter any true limitations to using NetCDF for Oceananigans output. Have you?

I have encountered issues with both NetCDF and JLD2 output regarding my ability to manipulate data in post-process. Things like differentiating, integrating, averaging, and working with fields at different locations is not convenient and requires care with indices. I think solutions for both output formats exist, but the solution is easier for JLD2. I have found it convenient to serialize types on occasion. These are minor convenience issues so far. But I'm worried that our applications are only the tip of the iceberg of potential Oceananigans applications. My applications are bare bones; mostly I am just plotting raw output. It's hard to anticipate what people might want to do. User-defined Lagrangian particles are one new thing that's a lot easier to serialize via JLD2.

I wasn't thinking of generic support for JLD2 with GeoData, just Oceananigans data. Presumably we know or can know all the info we need to give to GeoData to use it's plotting utilities?

Hopefully we can figure out parallel plotting in julia; I am a lot more productive in julia (since I have invested so much time in it, whereas I have not invested the same in python) so I usually prefer julia solutions.

ali-ramadhan commented 3 years ago

Are there are two issues here maybe?

  1. We are developing LESbrary.jl together. Should we decide on a common file format to use? Having one file format would probably make it easier to work together.

  2. We are committing to make the LESbrary.jl simulation output publicly available. Which file format do we use to distribute the data?

adelinehillier commented 3 years ago

One idea would be to output both for the simulations whose outputs we're going to make publicly available, though that would of course be memory intensive. I think the command line argument is a good idea!

ali-ramadhan commented 3 years ago

The JLD2OutputWriter and NetCDFOutputWriter interfaces are quite similar and pretty close to being unified.

If JLD2 Oceananigans output could be accessed nicely via GeoData.jl then the same setup/analysis/visualization code would work with both JLD2 and NetCDF.

glwagner commented 3 years ago

There are definitely some challenges to address here. With JLD2 we would serialize things like boundary conditions, coriolis, buoyancy model, etc. With NetCDF, we probably want to come up with a way to automatically add parameters associated with these to metadata. This might be useful to add to NetCDFOutputWriter within Oceananigans.

ali-ramadhan commented 3 years ago

@glwagner Just replying to your comments about saving simulation parameters as metadata here:

I would definitely be on board with a more coherent approach to saving simulation parameters. In this PR I basically saved every parameter I could think of since I didn't know what I would need, but this in itself could be misleading as you've pointed out. I agree that we should save every important parameter required for reproducing the simulation and no more.

As for saving metadata to disk, I agree serializing the struct to disk is an elegant approach. For NetCDF we could take an approach like we did with add_schedule_metadata!.

In my workflows I've been using the parameters from Oceananigans output to initialize other models, e.g. KPP or a neural differential equation to compare with. But perhaps if you're working with multiple models then these parameters should be specified before any model is run to avoid having to e.g. specify a meaningless reference density in Oceananigans.

glwagner commented 3 years ago

Discussed on zoom:

ali-ramadhan commented 3 years ago

@glwagner I've implemented the four changes the three of us agreed on.

Plotting (done if the --animation flag is passed) actually takes quite a while since it's all done serially on one core. I think when running an 8 day simulation the plotting actually takes longer than running the simulation lol.

That said, I kinda wish I made the plots with Makie.jl/CairoMakie instead of Plots.jl. Makie.jl should be faster. Switching to Makie.jl will also mean that we won't need to depend on GeoData.jl (also waiting for https://github.com/rafaqz/GeoData.jl/pull/129 to get merged).

@glwagner I'll leave it up to you to decide whether the switch to Makie should be done as part of this PR or if it could wait for a future PR.