Separate scripts for correcting typos and renaming domains

IALSA / IALSA-2015-Portland

Hub to accompany IALSA 2015 workshop at Portland, OR, Feb 22-25, 2015

GNU General Public License v2.0

2 stars 0 forks source link

Separate scripts for correcting typos and renaming domains #173

Open andkov opened 7 years ago

andkov commented 7 years ago

Currently these two tasks are accomplished by a single script `./manipulation/rename-classify.R.

Such practice is far from optimal for the following reasons:

once all models are generated automatically, spelling correction will be obsolete
may need to re-organize domains for a specific study to increase the bin size within domains
different tracks may require different domain grouping.

For these and other reasons, it is advisable to develop a function that would take in a catalog and and the external csv with grouping instructions, so that this procedure could be applied immediately before table or graph production and NOT during the manipulation phase.

wibeasley commented 7 years ago

@andkov, for the renaming part of the script (currently at line 172), consider pulling that out intoa metadata csv with three columns: name_old, name_new, and comments.

It may not be worth messing with now, unless there are multiple name_olds that map to a single name_new. For instance, say one of the scripts produces aa_TAU_00_est, while another (renegade set of scripts had produced aa_TAU_est_00. Assuming a third set of scripts didn't use both aa_TAU_00_est and aa_TAU_est_00, this should work.

andkov commented 7 years ago

Good point, thank you, @wibeasley. I would very much like a registry of names of model components. This would especially be useful for different tiers of coordination:

1 - drivers prepare data and run models on their own (like we did for Portland-2015)
2- drivers use automation scripts for modeling and submit model through github (what we do now)
3 - drivers using a REDCap API to run and submit models (the envisioned future)

The next work-through of the existing scripts will help me identify where the renaming you've mentioned should be the most organic.

wibeasley commented 7 years ago

Cool. Then here's a regex script that will pull out those values and put them into a CSV. Copy & paste the meat of that dplyr::rename() snippet so it looks like:

 column_renames <- '
  # general model information
    "study_name"                  = "`study_name`"
  , "model_number"                = "`model_number`"
  , "subgroup"                    = "`subgroup`"
  , "model_type"                  = "`model_type`"
...
  , "b_gamma_16_se"               = "`b_GAMMA_16_se`"
  , "b_gamma_16_wald"             = "`b_GAMMA_16_wald`"
  , "b_gamma_16_pval"             = "`b_GAMMA_16_pval`"
'

Then run this and rename/move the column-renames.csv in some metadata directory.

pattern <- '(?s).+?"(\\w+)"\\s+=\\s*"`(\\w+)`".*?'
rearranged <- gsub(pattern, "\\2,\\1,\n",  column_renames, perl=TRUE) 
rearranged

ds <- rearranged %>% 
  readr::read_csv(, col_names = c("name_old", "name_new", "comments"))

readr::write_csv(ds, "./column-renames.csv")

This is a handy little script for converting code into proper metadata. I'm surprised we haven't need to write something like this yet.

wibeasley commented 7 years ago

This is the code that should work (I haven't tested it) when you read the metadata and apply the column name changes.

ds <- readr::read_csv("./column-renames.csv")
renaming_vector        <- ds$name_old
names(renaming_vector) <- ds$name_new

ds_names_new <- ds_names_old %>% 
  dplyr::rename_(.dots = renaming_vector)

edit:: and don't be afraid to add extra columns to this, if it helps anything.

andkov commented 7 years ago

Great regex example for studying. I've finally got over the initial scare of using it and can learn more elaborate applications. Can't imagine an efficient data manipulations without regexes anymore. Thanks for pushing me down that hill!