OHDSI / FeatureExtraction

An R package for generating features (covariates) for a cohort using data in the Common Data Model.
http://ohdsi.github.io/FeatureExtraction/
61 stars 60 forks source link

Solution to replace default Table 1 specs #92

Closed gowthamrao closed 4 years ago

gowthamrao commented 4 years ago

Table 1 specs is currently fixed. There is no simple solution to replace the standard table 1 specs. This is a needed functionality - as default may not be relevant for all cohorts.

We need a solution to do a simple replace of the default Table1Specs. This solution needs to be documented.

Continued from here https://github.com/OHDSI/CohortDiagnostics/issues/32#issuecomment-611865136

Solution should be available as a function call for any downstream package that uses Feature Extraction's create table 1 function

gowthamrao commented 4 years ago

Also related to https://github.com/OHDSI/FeatureExtraction/issues/84 https://github.com/OHDSI/FeatureExtraction/issues/85 https://github.com/OHDSI/FeatureExtraction/issues/87

gowthamrao commented 4 years ago

There are three fields in Table 1 specs:

  1. Covariate ids: maybe populated using ROhdsiWebapi::getConceptSetConceptIds seperated by ";" @schuemie do we want the option for commas as seperators? CovariateId = (conceptId*1000)+analysisId

  2. Analysis Ids: PreSpec analysis id's are here and here (temporal) . @schuemie do we want ability to introduce custom analysis ids. e.g. if we want to support 'feature cohorts' i.e. we use another cohort as an input for feature calculation?

  3. Label: meaningful name to classify the output of table 1

gowthamrao commented 4 years ago

For 1. Covariate ids: Steps:

Resolve concept expression: ROhdsiWebapi::getConceptSetConceptIds . But this may return conceptIds from various domains. Given a list of conceptIds, we need details of conceptId from the concept table - so that we can then classify them based on domainId. We need a function, in ROhdsiWebApi, that accepts a list of conceptIds and returns a data frame from conceptTable. This would be POST call using this end-point.

schuemie commented 4 years ago

I disagree with your assessment that "Table 1 specs is currently fixed". I deliberately made it customizable, by replacing the specifications argument when calling the function.

The problem is that it takes quite a lot of knowledge to know how to modify the specifications to your liking. Remember, the specifications represent a hand-picked selection of covariates to show. I don't think we need (or can) automate anything here.

The biggest challenge is explaining (and / or facilitating) covariate IDs and analysis IDs. Covariate IDs are not the same as concept IDs, although for most analyses we made it so one can easily be derived from the other. Analysis IDs are not clearly documented anywhere, although the standard ones are listed in this hard-to-read CSV file. But we should also support custom covariates, which also will have covariate IDs and analysis IDs.

I recommend we instruct people to use a CovariateData object as the starting point for creating their own table 1 specs. The CovariateData object has covariateRef and analyssiRef objects, that detail all the covariates and analyses. We might even create a small editor (say, in Shiny) that, using a CovariateData object as input, allows users to pick analyses and covariates. But to me that seems overkill. A good vignette documenting how to create table 1 specs is probably sufficient.

A note of warning: the implementation of FeatureExtraction is about to change drastically with the replacing of ff.

gowthamrao commented 4 years ago

The problem is that it takes quite a lot of knowledge to know how to modify the specifications to your liking. Remember, the specifications represent a hand-picked selection of covariates to show. I don't think we need (or can) automate anything here

I will try to write a function to support this important need. Something like modifyTable1Specifications . After receiving your feedback on that function, then let me tackle

I recommend we instruct people to use a CovariateData object as the starting point for creating their own table 1 specs. The CovariateData object has covariateRef and analyssiRef objects, that detail all the covariates and analyses. We might even create a small editor (say, in Shiny) that, using a CovariateData object as input, allows users to pick analyses and covariates. But to me that seems overkill. A good vignette documenting how to create table 1 specs is probably sufficient.

I will post this function on this thread.

gowthamrao commented 4 years ago

@schuemie here is my untested code. If you think this is a good approach, i will tested it and do a PR?

#' Modify table 1 specifications
#'
#' @description
#' Modifies table 1 specifications that is in the structure defined here \code{\link{getDefaultTable1Specifications}}.
#' The output of this function maybe used with the \code{\link{createTable1}} in place of default table 1 specifications.
#' Note: Rows in Table 1 specifications are uniquely identified by combination of label and analysisId. AnalysisId should be
#' one of the AnalysisId's that are part of 'PrespecAnalyses.csv' in the installed package. CovariateIds are calculated by
#' multiplying (conceptId * 1000) + analysisId
#'  
#' @param deleteAnalysisId                          An integer representing the analysisId to be deleted.
#' @param deleteLabel                               A character string representing the label to be deleted.
#' @param insertAnalysisId                          An integer representing the analysisId to be inserted.
#' @param insertLabel                               A character string representing the label to be inserted.
#' @param insertConceptIdConceptNameDf              A data fram object with two columns. conceptId (integer) and domainId (character)
#'                                                  as in the OMOP CDM Concept table. The domainId will be used to match with the expected domainId
#'                                                  in the AnalysisId of PreSpecAnalyses.csv in the installed package.
#' @param table1Specifications                      Specifications of which covariates to display, and how.
#' @param seperatorForCovariateIds                  (optional) How should covariateId values be seperatd in covariateIds field. 
#'                                                  Default = ",".
#' @return                                          A tibble specifications object.
#' @export
modifyTable1Specifications <- function(table1Specifications = FeatureExtraction::getDefaultTable1Specifications(), 
                                       deleteAnalysisId = NULL,
                                       deleteLabel = NULL,
                                       insertAnalysisId = NULL,
                                       insertLabel = NULL,
                                       insertConceptIdConceptNameDf = NULL,
                                       seperatorForCovariateIds = ",") {

  table1Specifications <- tidyr::as_tibble(table1Specifications)

  if (all(table1Specifications == FeatureExtraction::getDefaultTable1Specifications())) {
    print("Starting with the default table 1 specifications from Feature Extraction to create modified table1 specifications.")
  }

  fileName <- system.file("csv", "PrespecAnalyses.csv", package = "FeatureExtraction")
  prespecAnalyses <- readr::read_csv(file = fileName, col_types = readr::cols())

  ## begin error checks
  errorMessage <- checkmate::makeAssertCollection()
  checkmate::assertDataFrame(x = table1Specifications, 
                             any.missing = FALSE,
                             null.ok = FALSE,
                             add = errorMessage)
  checkmate::assertDataFrame(x = prespecAnalyses, 
                             any.missing = FALSE,
                             null.ok = FALSE,
                             .var.name = "(PrespecAnalyses.csv installed in Feature Extraction package)",
                             add = errorMessage)
  checkmate::assertNames(x = names(table1Specifications),
                         type = "unique",
                         must.include = c("label","analysisId","covariateIds"),
                         what = "colnames",
                         add = errorMessage)

  if (any(!is.null(deleteAnalysisId) | !is.null(deleteLabel))) {
    checkmate::assertCharacter(x = deleteLabel,
                               any.missing = FALSE,
                               len = 1,
                               null.ok = FALSE,
                               add = errorMessage)
    checkmate::assertInteger(x = deleteAnalysisId,
                               any.missing = FALSE,
                               len = 1,
                               null.ok = FALSE,
                               add = errorMessage)
    checkmate::assertNames(x = deleteLabel,
                           type = "unique",
                           subset.of = table1Specifications$label %>% unique(),
                           add = errorMessage)
    checkmate::assertNames(x = deleteAnalysisId %>% as.character(),
                           type = "unique",
                           subset.of = table1Specifications$analysisId %>% unique() %>% as.character(),
                           add = errorMessage)
  }
  if (any(!is.null(insertAnalysisId) | !is.null(insertLabel) | !is.null(insertConceptIdConceptNameDf))) {
    checkmate::assertCharacter(x = insertLabel,
                               any.missing = FALSE,
                               len = 1,
                               null.ok = FALSE,
                               add = errorMessage)
    checkmate::assertInteger(x = insertAnalysisId,
                             any.missing = FALSE,
                             len = 1,
                             null.ok = FALSE,
                             add = errorMessage)
    checkmate::assertNames(x = insertAnalysisId %>% as.character(),
                           type = "unique",
                           subset.of = prespecAnalyses$analysisId %>% unique() %>% as.character(),
                           add = errorMessage)
    checkmate::assertDataFrame(x = insertConceptIdConceptNameDf,
                               any.missing = FALSE,
                               min.rows = 1,
                               min.cols = 2,
                               add = errorMessage)
    checkmate::assertNames(x = names(insertConceptIdConceptNameDf),
                           type = "unique",
                           must.include = c("conceptId", "conceptName"),
                             add = errorMessage)
  }
  errorMessage <- checkmate::makeAssertCollection()
  ### end of error checks

  if (any(!is.null(deleteAnalysisId) | !is.null(deleteLabel))) {
    deletedTable1Specifications <- table1Specifications %>% 
      dplyr::filter(label == deleteLabel, analysisId == deleteAnalysisId)
    table1Specifications <- table1Specifications %>% 
      dplyr::left_join(deletedTable1Specifications %>% dplyr::select(-covariateIds) %>% dplyr::mutate(rowsToDelete = TRUE),
                       by = c("analysisId" = "analysisId", "label" = "label")) %>% 
      dplyr::filter(is.na(rowsToDelete)) %>% 
      dplyr::select(-rowsToDelete)
  }

  if (any(!is.null(insertAnalysisId) | !is.null(insertLabel) | !is.null(insertConceptIdConceptNameDf))) {
      listConceptIds <- insertConceptIdConceptNameDf %>% 
        dplyr::filter(domainId == prespecAnalyses %>% 
                        dplyr::filter(analysisId == insertAnalysisId) %>% 
                        dplyr::select(domainId) %>% 
                        dplyr::pull()
        ) %>% 
        dplyr::select(conceptId) %>% 
        unique() %>% 
        dplyr::pull()

      insertTable1Specifications <- dplyr::tibble(label = insertLabel, 
                      analysisId = insertAnalysisId, 
                      covariateIds = paste((listConceptIds*1000)+insertAnalysisId, collapse = seperatorForCovariateIds)
        )

      table1Specifications <-  
        dplyr::left_join(x= table1Specifications, 
                         y = insertTable1Specifications, 
                         by = c("label" = "label", "analysisId" = "analysisId")) %>%
        dplyr::mutate(covariateIds = paste(covariateIds.x, 
                                           covariateIds.y, 
                                           sep = seperatorForCovariateIds)
        ) %>% 
        dplyr::mutate(covariateIds = stringr::str_replace(string = covariateIds, 
                                                          pattern = paste0(seperatorForCovariateIds, "NA"), 
                                                          replacement = "")
        ) %>% 
        dplyr::mutate_if(is.character, list(~dplyr::na_if(.,""))) %>% 
        dplyr::select(label, analysisId, covariateIds) %>% 
        tidyr::separate_rows(covariateIds, sep = seperatorForCovariateIds) %>%
        dplyr::group_by(label, analysisId) %>%
        dplyr::summarise(covariateIds = paste(unique(covariateIds), collapse = seperatorForCovariateIds))
  }
  return(table1Specifications)
}
schuemie commented 4 years ago

Do you think this function will make it easier to edit the specifications compared to just letting people edit the text file? It is also not clear how you specify covariate IDs (which are not the same as concept IDs)

gowthamrao commented 4 years ago

Yes. I think it will be very useful by making it easier, but also less error prone (because of the checks, deduplication, domainId matching to anlaysisId) - compared to the external file.

Here are my assumptions:

AnalysisId: analysisId's are part of the System.file i.e. installed in the package. An advanced user may change analysisId using packageMaintenance . But the analysisId has to be installed and pre-specified prior to calling this function. i.e. This function does not add/delete/manage AnalysisId.

CovariateId: Yes, this function supports creation of CovariateIds that are (conceptId 1000) + analysisId. ` covariateIds = paste((listConceptIds1000)+insertAnalysisId, collapse = seperatorForCovariateIds)` . As written right now, it does not support CovariateIds like that of Charlson comorbidity index = 1901, but i think we could add that functionality too.

schuemie commented 4 years ago

Why not work from an existing CovariateData object, where we have all the covariate IDs and analysis IDs, and can just pick from those?

gowthamrao commented 4 years ago

I did not study CovariateData object, so i dont know. Does using CovariateData object allow specifiying table 1 label/structure?

schuemie commented 4 years ago

The CovariateData object (the result from getDbCovariateData) has a covariateRef table and an analysisRef table. Together they specify all the covariates and analysis that were generated. If you're hand-picking covariates and analysis to show, that would be the most convenient source. It even includes any custom covariates people may have constructed. No need to try to figure out which analysis is what, and what logic to use to construct covariate IDs