eco4cast / EFIstandards

Exploring possible metadata and data formatting standards for comparing Ecological Forecasts
BSD 2-Clause "Simplified" License
14 stars 8 forks source link

Additional metadata in attribute definition #17

Open ashiklom opened 4 years ago

ashiklom commented 4 years ago

Per the discussion today, were we looking for something like this? The general idea is that attributeDefinition has the format [variable_type]{Variable definition...}.

library(magrittr, include.only = "%>%")

attributes <- tibble::tribble(
  ~attributeName,     ~attributeDefinition,                      ~unit,                  ~formatString, ~numberType, ~definition,
  "time",              "[dimension]{time}",                                    "year",                "YYYY-MM-DD",  "numberType", NA,
  "depth",             "[dimension]{depth in reservior}",                      "meter",                NA,           "real",       NA,
  "ensemble",          "[dimension]{index of ensemble member}",                "dimensionless",        NA,           "integer",    NA,
  "species_1",         "[statevariable]{Population density of species 1}",         "numberPerMeterSquared", NA,          "real",       NA,
  "species_2",         "[statevariable]{Population density of species 2}",         "numberPerMeterSquared", NA,          "real",       NA,
  "data_assimilation", "[flag]{Flag whether time step assimilated data}", "dimensionless",        NA,           "integer",    NA
)
attributes
#> # A tibble: 6 x 6
#>   attributeName  attributeDefinition  unit    formatString numberType definition
#>   <chr>          <chr>                <chr>   <chr>        <chr>      <lgl>     
#> 1 time           [dimension]{time}    year    YYYY-MM-DD   numberType NA        
#> 2 depth          [dimension]{depth i… meter   <NA>         real       NA        
#> 3 ensemble       [dimension]{index o… dimens… <NA>         integer    NA        
#> 4 species_1      [statevariable]{Pop… number… <NA>         real       NA        
#> 5 species_2      [statevariable]{Pop… number… <NA>         real       NA        
#> 6 data_assimila… [flag]{Flag whether… dimens… <NA>         integer    NA

parse_attribute_definition <- function(string) {
  regex <- "\\[(.*?)\\]\\{(.*?)\\}"
  m <- regexec(regex, string)
  result <- regmatches(string, m)
  output <- do.call(rbind, result)[,-1]
  colnames(output) <- c("variable_type", "variable_definition")
  output
}

parse_attribute_definition(attributes$attributeDefinition)
#>      variable_type   variable_definition                      
#> [1,] "dimension"     "time"                                   
#> [2,] "dimension"     "depth in reservior"                     
#> [3,] "dimension"     "index of ensemble member"               
#> [4,] "statevariable" "Population density of species 1"        
#> [5,] "statevariable" "Population density of species 2"        
#> [6,] "flag"          "Flag whether time step assimilated data"

Created on 2020-09-15 by the reprex package (v0.3.0)

mdietze commented 4 years ago

looks good to me. I think we'd just want a concrete list of the allowable variable_types. I think I'd add: driver, parameter, random_effect, observation, observation_error, process_error (obviously we'd update this list if we update the uncertainty list), and diagnostic (since @rqthomas mentioned this was useful in his files). Two (related) questions I'd have:

rqthomas commented 4 years ago

One case to consider is a flux (so it isn't a state) that is assimilated (so it isn't a diagnostic). This would fall through the classification cracks. Also, is there an easier regex to parse. I just use a colon ":" to separate the variable_type from the actual long name. However, what you present is cleaner to read and if the average user isn't going to have to right complex regex statements then I am fine with your proposal.

ashiklom commented 3 years ago

I think we'd just want a concrete list of the allowable variable_types.

Yup, this can be implemented as a factor, and we can throw errors if the result has any NAs.

do we need to have an initial_condition type and a statevariable type or are they always one and the same?

I'm inclined to think they're the same, but I'm open to counterexamples.

Is it possible for a single variable to have more than one type?

I think we should define our types to avoid this if at all possible (i.e., if this is possible, then we haven't defined our types well). From an implementation standpoint, there's no reason we couldn't implement multiple types with either [type1][type2]{description} or [type1|type2]{description} (or similar), but everything is simpler (conceptually and for implementation) if a variable can only have one type.

One case to consider is a flux (so it isn't a state) that is assimilated (so it isn't a diagnostic)

Even though it breaks ontogenies, I'd probably be OK calling that a "state".

Also, is there an easier regex to parse.

I picked this regex specifically for its parseability. As long as we define just a few simple rules— the [type] has to come first, no [] characters inside the type, and no characters after the description, the following should be pretty robust to just about any input. Note that the ? in the first .*? specifies a non-greedy regex, so it will find the shortest string before a ] (rather than the default, which is greedy and will find the longest match; that could slurp up [] in the description). I also added a few * to make this robust to whitespace.

"^ *\\[(.*?)\\] *\\{(.*)\\} *$"

if the average user isn't going to have to right complex regex statements

Yeah, definitely not. The regex will be hard-coded in a parse_variable() or similar function in this package or elsewhere.