Closed nichollsh closed 3 months ago
Running a large-ish grid of models indicates that this is going to be an issue going forward. I have a grid with 1029 points (7*7*7*3) running, which will take about 92 hours to complete. However, the total final size is estimated to be at least 516 GB. This is unreasonably large if we want to analyse (all of) the results.
I am sure there's a way to store the output in a format which is smaller on disk and easier to read.
In the mean time, the best way to mitigate this is to simply read in parts of the data as required. e.g: only a subset of the cases, or only the final states of all cases.
This seems like a good point for discussion with @lsoucasse and @stefsmeets in the next few weeks.
Since this keeps coming up, what about we 'simply' move every output everywhere to HDF5 file format? This has both a Python and Julia interface: https://docs.h5py.org/en/stable/index.html https://juliaio.github.io/HDF5.jl/stable/
I agree that a format like this would be appropriate.
However, I would very much prefer that we use NetCDF instead of HDF5. This is mostly because we are already using NetCDF for storing a fair bit of data anyway, including JANUS and AGNI outputs, opacity information in the DACE pipeline, and in various plotting/analysis tools. NetCDF is also (I think) more popular, so people will be able to access and interpret our data more readily.
There's been a comparison of the two recently. NetCDF is faster at opening/closing files, while HDF5 is faster at reading/writing files. However, the performance differences are pretty minimal. NetCDF is technically a derivative of HDF5.
Both file formats can be interfaced with Xarray, Pandas, Julia, etc.
Ok, makes sense to me. If @lsoucasse is fine with this, let's aim for unifying all data output to NetCDF then in the future.
Sounds good for me.
Excellent. I think the first step towards this would be to separate input/output variables, as discussed before. Then when we've defined what exactly is an "output" (for now) we can hopefully make an easy move towards using NetCDF.
We can use Xarray to open/close the NetCDF files, potentially, or just interface directly with the netCDF4 library.
Started work on points 1 and 2 of the above in this branch: https://github.com/FormingWorlds/PROTEUS/tree/detangle
@lsoucasse, @timlichtenberg do you feel that this issue is now completed? We have achieved all of the points above except point 4, which is covered by issue #74.
I agree we can close it.
It is important to be able to store, share, and analyse model outputs. At the moment, PROTEUS generates an output folder for each simulation and places most of the files within a
data/
subfolder. This is fine, but it does not make the outputs easy to share.We should consider a friendlier method for sharing the outputs of the model, particularly if it includes compressing the data. Simply compressing the
data/
folder into a Zip file for the current earth_demo case reduces its size by a factor of 3.3x, which is important to consider when running many models across a grid.The ideal case would be to synthesise the model output into a single file, including everything from SPIDER, SOCRATES, etc. If done properly, this is highly portable and much faster than reading many separate files. E.g. using Xarray it is possible to read only part of a file from the disk.
Potentially connected to Issue #71.