NOAA-PMEL / Ferret

The Ferret program from NOAA/PMEL
https://ferret.pmel.noaa.gov/Ferret/
The Unlicense
55 stars 21 forks source link

Subsetting DSG datasets #1951

Open AnsleyManke opened 4 years ago

AnsleyManke commented 4 years ago

From Hankin notes in https://docs.google.com/document/d/19i-fbyA3XvPkwNp5lxXlnliIBiKwYDlZhz4B_oHKMGk/edit

[see also the SET DATA/FMASK capability implemented after these notes were written. It may give us the piece of this that we need for storing the mask.]

The currently implemented code reads entire DSG variables into memory; it gives the illusion of handling subsets, but in memory a DSG variable object always occupies the full size of the DSG ragged array. Here we explore the concept of a “subset dataset” as a way to reduce memory usage.

Sample Session - Subset Dataset

! open the dataset as usual
   yes? USE giant_dsg.nc

! define the desired subset
  yes? DEFINE DATA/PARENT=giant_dsg mySubset = “<constraint expression>”

! access a subsetted variable, using reduced memory
  yes? PLOT temp

Under the Hood: Subset Dataset

yes? USE giant_dsg.nc

No change from any other DSG. The coordinate variables are read at full size, which would make this simple approach unsuitable for tera-scale DSG datasets. But only the coordinate variables are read.

yes? DEFINE DATA/PARENT=giant_dsg mySubset = “<constraint expression>”

Steps to implement:

Use the logic of the “LET/D=” command to store the definition of the mySubset constraint expression in the parent dataset.

Evaluate the constraint expression. If the expression involves DSG variables other than the coordinates, this will trigger reading of those variables. Cant be helped.

Check that the resulting expression is a mask - zeros and ones. The resulting mask should be an Instance variable, rather than an Observations variable. This avoids a lot of complexity and confusion. (Observations-level masking can still be applied later.)

A cute addition would be to allow (e.g.) mySubset = “{5, 7, 9}”, giving the user a tool to pluck out individual features by E number. It would be simple to create the equivalent mask from this.

Create a new dataset (the subset).

Copy the parent variables into it in XDSET_INFO.

Store the subset mask in linemem with a new variable dsg_subset_mask_lm pointing to it.

Store subsetted rowSize and coordinate variables in linemem by applying the mask to the parent coordinate variables.

Create the feature and orientation axes of this subset dataset (variations on the cd_dsg*.F routines)

yes? PLOT temp

Enhance cd_dsg_read.F so it recognizes a subset dataset, and reads the subsetted ragged array as a series of chunks, guided by the stored mask. Some optimization might be desirable to minimize the number of individual reads. The parent rowSize provides guidance to trade off number of IO requests against data volume.