OHDSI / GIS

https://ohdsi.github.io/GIS
Apache License 2.0
8 stars 9 forks source link

Refactor standardizeStaged() to handle "filtering" R code from spec #98

Closed kzollove closed 1 year ago

kzollove commented 1 year ago

Operations that standardize R code, (included in the attr_spec JSON entry in a record of backbone.variable_source), may employ a filtering operation to extract an attribute from table:

"value_source_value": {"type": "code", "value": "dplyr::pull(dplyr::filter(staged, `Defining Parameter` =='CO'),'AQI')"}

The outer function dplyr::pull(x, 'AQI') is pulling the values from the AQI column. The inner function dplyr::filter(staged, `Defining Parameter` =='CO') defines the subset of the staged table to be used (i.e. only rows where Defining Parameter column has the value CO). The result of this function ends up in the attr_X column value_source_value

The issue is in the way that standardizeStaged() is currently written:

standardizeStaged <- function(staged, specTable) {
  selectRules <- specTable[specTable$t_type == "select",]
  stagedResult <- staged[,selectRules$t_value] %>% as.data.frame()
  names(stagedResult) <- selectRules$t_name

  hardcodeRules <- specTable[specTable$t_type == "hardcode",]
  if (nrow(hardcodeRules) > 0){
    for(i in 1:nrow(hardcodeRules)){
      stagedResult[hardcodeRules$t_name[i]] <- hardcodeRules$t_value[i]
    }
  }

  rCodeRules <- specTable[specTable$t_type == "code",]
  if (nrow(rCodeRules) > 0){
    for(i in 1:nrow(rCodeRules)){
      stagedResult[rCodeRules$t_name[i]] <- eval(parse(text=rCodeRules$t_value[i]))
    }
  }

  return(stagedResult)
}

where the second line of the function defines number of rows in the resulting stagedResult table. Trying to "assign" a new column as is done in stagedResult[rCodeRules$t_name[i]] <- eval(parse(text=rCodeRules$t_value[i])), simply doesn't work if its not the same number of rows/entries as stagedResult.

There are two possible solutions I see:

  1. Change the logic of standardizeStaged() so that instead of "assigning" columns that were sourced using R Code (i.e. "type": "code"), they are effectively "Right Joined" so that the number of rows in the stagedResult table matches the number of entries in the subset of the staged table.
  2. Add another option to the attr_spec, where "type": "filter" is used to signify a piece of filtering R code that would require the "Right join" as highlighted above. This seems to be changing to mechanisms to fix one problem. The only reason this would be preferable is if there is some instance where we specifically do not want the "type":"code" option to cause a join.

Do either of these solutions propose a foreseeable issue? Should a different approach be considered?

kzollove commented 1 year ago

While standardizeStaged() does need to change a bit, the simplest solution turns out to be:

1) Adding optional field to the attr_spec, stage_transform. This field can be used to pass R Code that alters the entire staged table. Its ideal use case is for filtering on one or more columns.

2) Handling the stage_transform commands in the standardizeStaged() function

kzollove commented 1 year ago

Newest iteration of this solution is:

  1. new attr_spec is a single element JSON (stage_transform) which is an array containing a series of R commands that operate iteratively on the staged source table. By the end of the commands, the staged table must contain the six required columns that go into stagedResult and eventually attr_X.
  2. standardizeStaged is now broken into standardizeStagedAttr and standardizeStagedGeom
  3. The logic from createSpecTable is now handled within standardizeStagedGeom. There is no specTable for standardizeStagedAttr
kzollove commented 1 year ago

Newest iteration of this solution is:

  1. Now attr_spec and geom_spec both follow the stage_transform format: an array containing a series of R commands that operate iteratively on the staged source table. By the end of the commands, the staged table must contain the required columns that go into stagedResult and eventually attr_X or geom_X.
  2. There is single, simplified standardizeStaged function that operates on staged attr or geom tables.
  3. The logic from createSpecTable is now unnecessary
kzollove commented 1 year ago

depends on #104 and #103

kzollove commented 1 year ago

Ready to be finished.