Column-oriented storage solutions for diagnostic data

Problem

One could argue that the diagnostic data is poorly suited for xarray/Zarr because the shape of each array is different across minimization loops and model runs. The number of observations, even within a variable, can differ from run to run and loop to loop. Because of the heterogeneity in the diagnostic data, we have to treat the data more like a table than like a multi-dimensional array.

For example, with model output, we can create an array with four dimensions: longitude, latitude, height, and time. These correspond to the grid cells produced by the model at each forecast time. Each cell should have a value because the model output is evenly distributed across this grid. At each forecast time, the same number of grid cells is present. The observations reported by the diagnostics for a model have no such regularity. The number of observations varies, the locations of some observation sites changes, and the observation sites are not arranged in a regular fashion that can be represented as a grid.

As such, we end up treating the diagnostic data more like a table, where each row represents a single observation and attributes such as longitude, latitude, height, and time are like columns in the table. Because we aren’t able to define these columns as dimensions on the array, we aren’t able to effectively chunk the data. Instead, we create separate groups, each of which has to be iterated over and read separately into memory when we need it. Additionally, we are reading the entire array into memory when we only need a few columns.

Solution

There are a couple of column-oriented file formats available for data storage that are worth investigating: feather and parquet. Both of these are supported by pandas and should therefore be easy to work with in Python. Pandas also appears to have support for reading from S3 using fsspec ( which we are already using).

Moving to a more table-like storage format would allow us to concatenate the observations from minimization loops into a single data structure by adding a column representing the loop. We might even find it worthwhile to concatenate model runs by adding an initialization time column. We can then chunk by these columns as needed to speed up access by the application.

It’s my understanding that a column-oriented store’s advantage is that it can load only the required columns into memory for processing. So, for example, using one of these stores, we should only have to load obs_minus_forecast_adjusted to produce visualizations, whereas with another type of store, we’d have to load all of the data just to read that one column.

[x] #431
[x] #450
[ ] Read model metadata from available parquet files instead of database
[ ] Remove writes to database

NOAA-GSL / unified-graphics

Column-oriented storage solutions for diagnostic data #339

Problem

Solution