File format for describing complete set of initial conditions

bangerth commented 4 years ago

On the FRES call last Friday, we spent a significant amount of time thinking through how we want to describe the initial conditions. The issues we have are:

Whether what ASPECT sees are the initial conditions for temperature plus other variables such as chemical composition and grain size (i.e., essentially every variable inside ASPECT that has a time derivative on it), or whether ASPECT gets to see what these quantities are computed from -- most notably Vp, Vs, and maybe density.
How exactly to present this data. We currently have the ASCIIData plugins, but I'm not a big fan of this format because it is not self-describing: The interpretation of what is in the file is implied rather than explicit, and can not be queried. There is also the issue that the initial condition models we want to describe may be hierarchical and multilevel: We may have a global background model with one or more "patches" (possibly overlapping) where we have higher resolution information (e.g., Western US, Cascadia). In some sense, the global background model is also a patch: It just encompasses the entire Earth. (This approach of "patches" is inspired by @MFraters's WorldBuilder -- just represented in an XML or HDF5 format, see below.)

My idea to represent this kind of information would be in a file format that stores the following:

It's a container of patches
Each patch stores: (i) it's geometric extent, in the simplest case just a box in lat/long/depth, (ii) a collection of depth slices, (iii) the names of the fields that are stored in each depth slice, (iv) a priority so that when we look for the initial condition at a point, highest priority patches are searched first -- typically, geographically small will mean highest priority.
Each depth slice of a patch is a container that stores (i) its depth, (ii) the lat/long spacing of points, (iii) a collection of field values defined at these points.

It should not be very difficult to represent all of these two things in an XML or HDF5 file. Both have the advantage that they can be queried and interrogated. We already have HDF5 functionality in deal.II that would make it pretty easy to read a table of values, which could then be handed off to something like InterpolatedTensorProductGridData or InterpolatedUniformGridData.

Both XML and HDF5 have the substantial advantage that there are bindings to many other languages (e.g., R or Python) that would allow us to write pre-processing scripts in other languages.

Questions:

Are there pre-existing file formats elsewhere that we could piggy-back on?
Most data we will use as inputs is seismic in nature (Vs, Vp, density) whereas we need temperature and compositions. Where should the conversion happen? Inside ASPECT, or in a pre-processing step? My preference would be to move this outside ASPECT to keep some sort of separation of concerns and avoid having to introduce yet another layer of plugins.
Is data always available for box-shaped patches, or do we need to allow for non-rectangular regions? We could press these into boxes by allowing "mask values" -- if the temperature is "NaN", just discard it and search again in the next lower priority patch. But there may be better ways of representing data.
Is data always available on a tensor product mesh in lat/long/depth? Or do we want to allow for completely unstructured meshes for each patch? That would make searching for the value at a particular point substantially harder, of course.
How do we represent uncertainties? In a self-describing file format like XML or HDF5, this is likely not very difficult: We just add more fields to the file, where the reader can query whether they exist and what their properties (HDF5 "attributes") are. But it's probably worth giving this some thought.

Ideas? Comments?

tjhei commented 4 years ago

A few thoughts: Creating an XML format sounds like a bad idea, as it requires some kind of hand-written vector/table structure inside an XML node. This could be text or base64 encoded (like vtu), but would require separate implementations in c++, python, etc.. I think building on hdf5 makes more sense (and makes it possible to write in parallel).

We may have a global background model with one or more "patches" (possibly overlapping)"

This might be true, but this could also be done by resampling to a single, fine global representation. I would assume that we can not overlay patches without some kind of smoothing at the boundary anyways. A single resolution might not be perfect, but would simplify our life. At least, we could get started with a single resolution and (if needed) allow a second regional file to be used as an overlay?

How do we represent uncertainties?

This is just another column of data, I would think.

Most data we will use as inputs is seismic in nature (Vs, Vp, density) whereas we need temperature and compositions. Where should the conversion happen?

That depends on the specific workflow, I would think. It is too early to tell what will be the best solution. If we can read/write arbitrary data fields, we should be good to go.

bangerth commented 4 years ago

I agree with XML vs HDF5. All I wanted to say is that both are self-describing file formats, and we should choose those.

Regarding sampling to the same resolution: We already have mesh refinement plugins that enforce substantially higher resolution in small parts of the domain. You want to have your initial conditions sampled to that resolution, but you may not be able to afford sampling everywhere to that resolution. For example, if you had a 10km resolution for the 1000x1000x500km under Cascadia (not crazy), then you would have to store >10^9 data points globally (too much to fit into memory on every processor). Let's not settle on a file format that we already know can't do what we want.

We could put each patch into a file of its own, if you prefer that and simply allow any number of such files -- but then at least in XML, the one-file-for-everything is simply the concatenation of the individual ones, and I suspect it's not more difficult in HDF5 either. But that I don't care about -- it may indeed be simpler to create multiple files, one per model on which an initial condition is based.

tjhei commented 4 years ago

We could put each patch into a file of its own, if you prefer that and simply allow any number of such files -- but then at least in XML, the one-file-for-everything is simply the concatenation of the individual ones,

hdf5 is basically a filesystem, you can store arbitrary many fields under a "path". We can easily store any number of variables and patches. We can also store metadata (like patch location, size, etc.) inside the hdf5 file or do it like xdmf where the metadata is stored in a separate text/xml file.

For example:

$ h5ls -f -v output-convection-box-3d/solution/mesh-00000.h5
/cells                   Dataset {12720/12720, 8/8}
    Location:  1:1400
    Links:     1
    Storage:   407040 logical bytes, 407040 allocated bytes, 100.00% utilization
    Type:      native unsigned int
/nodes                   Dataset {42930/42930, 3/3}
    Location:  1:800
    Links:     1
    Storage:   1030320 logical bytes, 1030320 allocated bytes, 100.00% utilization
    Type:      native double

I just realized that we could write the data as hdf5+xdmf with our fields as "uniform grids" and we can read them into paraview/visit to inspect. For each grid, you can specify the topology (bounding box, number of cells in each direction, etc.). In the hdf5 file these are simple 3 dimensional tables for each patch for each variable. I think this satisfies everything we need!

bangerth commented 4 years ago

In the currently ongoing CIG webinar, the speaker states that for the IRIS Earth Model Collaboration (EMC), the use netCDF as a way to represent earth property models. It might be interesting to see how that format is defined.

bangerth commented 4 years ago

Let's try to see whether we get Brandon Schmandt via @pythonlover8 :-)

maxrudolph commented 4 years ago

I noticed this thread while commenting on the mpi-io issues and have a couple of comments to add.

First @bangerth it is important to distinguish between data and models. Seismic tomography produces models that are constrained by data (e.g. waveforms, travel times). This may seem semantic, but it is not. To view a global or regional tomographic model as data is to ignore the underlying data constraints (data types and quality), theoretical framework, and assumptions about damping and regularization that were employed in the construction of a model of wavespeed variations. This is especially important if you are going to try to quantify or propagate uncertainty associated with the tomographic model used to initialize the geodynamic model. When I teach inverse theory, I usually start with curve fitting. If you're fitting a function of the form ax=b to observations {x_i,y_i}, the data are your x_i, y_i, and the model is the evaluation of f(x) = ax+b for the optimal coefficients a,b. The same is true in tomographic imaging - data are waveforms, traveltimes, frequencies of modes etc. The model is an optimal set of parameters subject to assumptions about for instance constitutive laws (e.g. isotropy or a model of anisotropy), the theory used for evaluating the forward model (e.g. ray theory, full waveform modeling), and extremely influential choices about damping and regularization.

From the perspective of someone who uses and analyzes global tomographic models, I think that it would be very helpful if there was a standardized set of routines for evaluating tomographic models using the same 1D model and parameterization for lateral wavespeed variations. Some groups have provided model evaluation tools that do just this - see for instance the a3d tool used to evaluate the UC Berkeley models semum and SEMUCB-WM1 and also the model evaluation program for S40RTS. Other models are only commonly released on regular grids (e.g. 2x2 degrees laterally and at a specified number of depths for S362ANI+M) but taking this model as an example, wavespeed variations are actually parameterized using 362 spherical splines in the lateral direction and cubic splines (with a discontinuity at 650 km depth) in the radial direction. The most elegant solution would be a set of classes with C++ and python bindings that could be used to provide accurate evaluations of wavespeed using the correct radial and lateral parameterizations and the corresponding 1D model. I think that the best tools available now for global models are Thorsten Becker's routines for working with models represented by spherical harmonic coefficients and the netcdf format used for distributing tomographic models e.g. through the IRIS DMC. However, the use of spherical harmonic expansions is not always appropriate for models that are not parameterized using spherical harmonics, and the radial parameterization is not faithful (we are left to assume piecewise constant or linear variations in the radial direction). The same limitations are true for models represented using regular grid (netcdf) file formats. Finally, I'll add that storing global tomographic models as regular grid data is not a great proposition because the grid spacing varies tremendously from pole to equator. A better approach is to use equispaced points on the sphere (e.g. the distribution of knots used for the spherical spline representation in Kustowski et al. 2008 or Moulik et al. 2014). This has also motivated the construction of unusual-seeming regular grids in rotated spherical coordinate systems for regional tomographic models e.g. of the North and South Atlantic by Colli et al. and Rickers at al.

I hope that this is useful food for thought.

bangerth commented 4 years ago

I'm German by origin, and so I live for semantic nitpicking :-) You're of course totally correct, and I was sloppy in my choice of words. Thanks for pointing it out!

I totally agree that lat/long/depth grids are, at least in principle, a poor choice. My motivation for these was primarily that they are rather cheap to evaluate because as tensor products, at worst you have to do three 1d searches (if the lat/long/depth nodes are not equally spaced along their respective coordinates) and at best you are done in O(1) if they are equidistant. In practice, you would want to start with a different model and sample it so that the resolution you need is achieved at the equator; at the poles, you get much higher (and unnecessarily fine) resolution -- that's annoying and costs memory, but at least keeps computations cheap.

That said, your note about rotated spherical coordinates makes me think that my idea of "patches" above may not have been so bad after all: One could subdivide the Earth mantle into 6 patches that are all parameterized differently with lat/long defined so that each of these patches doesn't contain the poles of their parameterization. Their respective "rotated" lat/long meshes would then be relatively uniformly spaced.

What I would like to avoid is the use of completely unstructured meshes, as well as ones with global basis functions (e.g. spherical harmonics). They may be best suited for compressed representations of the information, but are too expensive to evaluate.

I appreciate your links to other people's software. Maybe we should collect names and websites of people we could contact. I suspect that there are efforts to define at least regional models in some sort of generic way; I just don't know where to start looking. @tjhei found a couple of websites that would make for good starting points and that I'm sure he'll copy-paste here as well.

tjhei commented 4 years ago

@tjhei found a couple of websites that would make for good starting points and that I'm sure he'll copy-paste here as well.

None of these contain 3d data, as far as I can tell.

bangerth commented 4 years ago

On 3/4/20 9:09 AM, Timo Heister wrote:

None of these contain 3d data, as far as I can tell.

Do you know how they store 2d data on the sphere?

maxrudolph commented 4 years ago

You might ask Kasra Hosseini what they're doing on the backend for submachine. They have implemented a few really nice features such as the ability to re-reference a 3D model to a different 1D model. I would want to be able to verify the provenance of any of the models and see how they're being evaluated.

https://www.earth.ox.ac.uk/~smachine/cgi/index.php?page=tomo_depth

Re irregular grids, I think that it's not as bad a situation as you might think to evaluate scattered grid data, and it only needs to be done once at the beginning of a calculation if you're using this as an initial condition. The tomographic models are only prescribed in terms of a few hundred thousand parameters and most do not have global support and have the same structure at each depth. So, for each evaluation point, you only need to do one calculation to determine weight coefficients in the lateral direction and then figure out which depth slices have support at the local depth (super fast). I think that there has to be some value in doing things right and being faithful to the functional form of the velocity variations in the tomographic models.

tjhei commented 4 years ago

submachine has an overview of all of their supported models with sources here: https://www.earth.ox.ac.uk/~smachine/cgi/index.php?page=tomo_depth#table_tx2019slab_p

I will contact them to figure out what their underlying data format is.

Another data point: IRIS has a large collection of "earth models": http://ds.iris.edu/ds/products/emc-earthmodels/ with citations/references/etc. For example, it lists GyPSuM (http://ds.iris.edu/ds/products/emc-gypsum/) with the data itself and the metadata: http://ds.iris.edu/files/products/emc/emc-files/GYPSUMP_kmps.nc http://ds.iris.edu/ds/products/script/emcscript/meta_2.py?model=GYPSUMP_kmps.nc

This seems to be netcdf on a lat/long/depth grid with adaptive spacing in depth direction. Similar for other models.

pmbremner commented 4 years ago

netcdf has a number of tools to convert ascii to a netcdf binary format (and vice-versa). At least one of the available binary formats is hdf5

pmbremner commented 4 years ago

So I just got done speaking with Ray Russo across the hall about tomography model formats, and his experience is similar to mine where the vast majority of models have been passed around in ascii format. The main reasons for this are the ease of coding for ascii and that there hasn't really been a standard. IRIS has been converting contributed models into netcdf format on their end, perhaps because they are managing a database and binary formats take less space, or because netcdf is also compatible with plotting in Generic Mapping Tools (spelled out in case someone isn't familiar with GMT). On the data side (maybe model side too?) the active source community has been using an hdf5 based format, I believe called ph5 (https://github.com/PIC-IRIS/PH5/wiki).

Based on this experience, and after our group meeting today, talking with IRIS about why they use netcdf format and what tools they already have available to convert contributed model formats to netcdf would be very useful, as well as learning the underlying binary format they use, which may have evolved over time, and if they are all gridded formats. Their conversion tools are not widespread in the seismic community, and so there is an opportunity to help with that if we do ultimately use their format.

bangerth commented 3 years ago

I should have left a reference to GeoTess (https://www.sandia.gov/geotess/) here already when we talked about it. But it's never too late.

geodynamics / aspect

File format for describing complete set of initial conditions #3380