darwin-eu / CodelistGenerator

Identifying relevant concepts from the OMOP CDM vocabularies
https://darwin-eu.github.io/CodelistGenerator/
Other
12 stars 8 forks source link

Use a list of dplyr table references to create a unified interface to the vocabulary tables #32

Closed ablack3 closed 2 years ago

ablack3 commented 2 years ago

getCandidateCodes take either a database connection or a directory path that has vocab tables in it. I would like to propose the introduction of a "vocabulary reference" object. This object is a list of Arrow Tables or dplyr table references pointing to a remote database. This object would then work with dplyr verbs and getCandidateCodes would accept a single "data" argument that is a vocabulary reference.

This also means that assertions for vocabulary table validation can be done only once when the vocabulary reference object is created instead of each time getCandidateCodes runs.

A draft implementation for creating vocabulary reference object could look like this.


#' Create a list of references to remote OMOP vocab tables
#'
#' @param con A database connection
#' @param schema The schema where the vocab tables are located. Defaults to NULL.
#'
#' @return A list of dplyr database table references pointing to 5 vocabulary tables
#' @export
vocabRefFromDatabase <- function(con, schema = NULL) {

  checkmate::assertClass(con, "DBIConnection")
  checkmate::assertCharacter(schema, null.ok = TRUE)

  vocabTableNames <- c("concept", "concept_ancestor", "concept_synonym", "concept_relationship", "vocabulary")

  if (!is.null(schema)) {
    vocab <- purrr::map(vocabTableNames, ~dplyr::tbl(con, dbplyr::in_schema(schema, .)))
  } else {
    vocab <- purrr::map(vocabTableNames, ~dplyr::tbl(con, .))
  }

  vocab %>%
    magrittr::set_names(vocabTableNames) %>%
    magrittr::set_class("VocabReference") %>%
    assertVocabColumnNames()
}

#' Create a list of vocab tables from a directory of parquet files
#'
#' @param path Directory containing  5 vocabulary parquet files:
#' "concept", "concept_ancestor", "concept_synonym", "concept_relationship", "vocabulary" all with
#' the .parquet extension
#'
#' @return A list of 5 Arrow Tables
#' @export
vocabRefFromFiles <- function(path) {
  checkmate::assertCharacter(path, len = 1)

  vocabTableNames <- c("concept", "concept_ancestor", "concept_synonym", "concept_relationship", "vocabulary")
  vocabPaths <- file.path(path, paste0(vocabTableNames, ".parquet"))

  checkmate::assertTRUE(file.exists(path))
  checkmate::assertTRUE(all(purrr::map_lgl(vocabPaths, file.exists)))

  purrr::map(vocabPaths, arrow::read_parquet, as_data_frame = FALSE) %>%
    magrittr::set_names(vocabTableNames) %>%
    magrittr::set_class("VocabReference") %>%
    assertVocabColumnNames()
}

assertVocabColumnNames <- function(vocabReference) {

  checkmate::assertSetEqual(names(vocabReference),
    c("concept", "concept_ancestor", "concept_synonym", "concept_relationship", "vocabulary"))

  checkmate::assertSetEqual(names(vocabReference$concept),
    c("concept_id", "concept_name", "domain_id", "vocabulary_id", "standard_concept"))

  checkmate::assertSetEqual(names(vocabReference$concept_ancestor),
    c("ancestor_concept_id", "descendant_concept_id", "min_levels_of_separation", "max_levels_of_separation"))

  checkmate::assertSetEqual(names(vocabReference$concept_synonym),
    c("concept_id", "concept_synonym_name"))

  checkmate::assertSetEqual(names(vocabReference$concept_relationship),
    c("concept_id_1", "concept_id_2", "relationship_id"))

  return(vocabReference)
}

The getCandidateCodes interface would look like

getCandidateCodes <- function(vocref,
                              keywords,
                              exclude = NULL,
                              domains = "CondItion",
                              conceptClassId = NULL,
                              standardConcept = "Standard",
                              searchSynonyms = FALSE,
                              searchNonStandard = FALSE,
                              fuzzyMatch = FALSE,
                              maxDistanceCost = 0.1,
                              includeDescendants = TRUE,
                              includeAncestor = FALSE,
                              verbose = FALSE) {

If this seems like a good idea what should we call such objects? Maybe vocref, vocab, vocabReference, or something else?

I'd also like to explore extending this to the CDM as well so we could have CDM reference objects that would be lists of table references to a CDM.

@edward-burn

edward-burn commented 2 years ago

@ablack3 Yes, I think this would be much nicer than how we currently do this (both here when referencing the vocabulary tables and elsewhere where we reference cdm tables with patient data). Because it seems like similar functionality would be nice across various analytic packages, maybe your suggested approach could be in its own package? What do you think?

edward-burn commented 2 years ago

@ablack3 as discussed let's try and incorporate this into a separate package that can become a dependency of CodelistGenerator. When you have a GitHub repo set up for that, let's please transfer this issue to there

edward-burn commented 2 years ago

Closing as we now have the CDMConnector package