Model output compression and storage

FormingWorlds / PROTEUS

Coupled atmosphere-interior framework to simulate the temporal evolution of rocky planets.

https://fwl-proteus.readthedocs.io

Apache License 2.0

12 stars 1 forks source link

Model output compression and storage #70

Closed nichollsh closed 3 months ago

nichollsh commented 9 months ago

It is important to be able to store, share, and analyse model outputs. At the moment, PROTEUS generates an output folder for each simulation and places most of the files within a data/ subfolder. This is fine, but it does not make the outputs easy to share.

We should consider a friendlier method for sharing the outputs of the model, particularly if it includes compressing the data. Simply compressing the data/ folder into a Zip file for the current earth_demo case reduces its size by a factor of 3.3x, which is important to consider when running many models across a grid.

The ideal case would be to synthesise the model output into a single file, including everything from SPIDER, SOCRATES, etc. If done properly, this is highly portable and much faster than reading many separate files. E.g. using Xarray it is possible to read only part of a file from the disk.

Potentially connected to Issue #71.

nichollsh commented 8 months ago

Running a large-ish grid of models indicates that this is going to be an issue going forward. I have a grid with 1029 points (7*7*7*3) running, which will take about 92 hours to complete. However, the total final size is estimated to be at least 516 GB. This is unreasonably large if we want to analyse (all of) the results.

I am sure there's a way to store the output in a format which is smaller on disk and easier to read.

In the mean time, the best way to mitigate this is to simply read in parts of the data as required. e.g: only a subset of the cases, or only the final states of all cases.

timlichtenberg commented 8 months ago

This seems like a good point for discussion with @lsoucasse and @stefsmeets in the next few weeks.

timlichtenberg commented 4 months ago

Since this keeps coming up, what about we 'simply' move every output everywhere to HDF5 file format? This has both a Python and Julia interface: https://docs.h5py.org/en/stable/index.html https://juliaio.github.io/HDF5.jl/stable/

nichollsh commented 4 months ago

I agree that a format like this would be appropriate.

However, I would very much prefer that we use NetCDF instead of HDF5. This is mostly because we are already using NetCDF for storing a fair bit of data anyway, including JANUS and AGNI outputs, opacity information in the DACE pipeline, and in various plotting/analysis tools. NetCDF is also (I think) more popular, so people will be able to access and interpret our data more readily.

There's been a comparison of the two recently. NetCDF is faster at opening/closing files, while HDF5 is faster at reading/writing files. However, the performance differences are pretty minimal. NetCDF is technically a derivative of HDF5.

Both file formats can be interfaced with Xarray, Pandas, Julia, etc.

timlichtenberg commented 4 months ago

Ok, makes sense to me. If @lsoucasse is fine with this, let's aim for unifying all data output to NetCDF then in the future.

lsoucasse commented 4 months ago

Sounds good for me.

nichollsh commented 4 months ago

Excellent. I think the first step towards this would be to separate input/output variables, as discussed before. Then when we've defined what exactly is an "output" (for now) we can hopefully make an easy move towards using NetCDF.

We can use Xarray to open/close the NetCDF files, potentially, or just interface directly with the netCDF4 library.

nichollsh commented 4 months ago

Disentangle input/output variables. Input variables not updated during runtime, and avoid passing entire dictionary to functions.
Organise output variables such that we avoid passing entire dictionary. Avoid redundant variable passing. See also: https://github.com/FormingWorlds/PROTEUS/issues/95
Format shared PROTEUS output data file consistently based only on model output variables. Pre-defined columns/variable names with consistent structure. With the aim of enabling easy restarts.
Reconsider configuration file format (issue #74)

nichollsh commented 4 months ago

Started work on points 1 and 2 of the above in this branch: https://github.com/FormingWorlds/PROTEUS/tree/detangle

nichollsh commented 3 months ago

@lsoucasse, @timlichtenberg do you feel that this issue is now completed? We have achieved all of the points above except point 4, which is covered by issue #74.

lsoucasse commented 3 months ago

I agree we can close it.