Filter tables - simple case with (model_type)

andkov commented 8 years ago

@wibeasley , please take a look at the example i've developed in ./manipulation/rename-classify.R. In stead of cognitive outcome, i've chosen a simpler case, model type. We don't have to deal with classification yet.

I've created a .csv file ./manipulation/model_type-entry-table.csv containing the instructions for renaming.

category_short	entry	notes
0	0	literal
0	empty	alternative to 0
a	a	literal
a	age	alternative to age
a	aeg	misspelled age
a	Age	misspelled age
a	AGe	misspelled age
ae	ae	literal
aeh	aeh	literal
aeh	ahe	misspelled aeh
aehplus	aehplus	literal
aehplus	aheplus	misspeled aehplus
full	full	literal

./manipulation/rename-classify.R first looks at the values across studies at line 43

> t <- table(ds$model_type, ds$study_name);t[t==0]<-".";t

          eas elsa hrs ilse lasa nuage octo radc satsa
  0       .   .    .   .    .    .     .    .    20   
  a       .   .    .   .    .    10    .    113  10   
  ae      58  .    .   .    .    .     .    109  34   
  aeh     57  .    24  14   .    16    72   116  34   
  aehplus 57  18   28  25   18   16    58   113  40   
  age     56  .    24  14   .    6     72   .    24   
  aheplus 1   .    .   .    .    .     .    .    .    
  empty   2   .    .   4    .    10    .    .    .    
  full    58  .    .   .    .    .     4    .    .

and then conducts re-assignment in lines 47-52

model_type_key <- read.csv("./manipulation/model_type-entry-table.csv", stringsAsFactors = F)
for(i in length(model_type_key$entry)){
  entry <- model_type_key[i , "entry"]
  category_short <- model_type_key[i , "category_short"]
  ds$model_type_new <- gsub(pattern = entry, replacement = category_short, x = ds$model_type )
}

but doesn't really do it, because

> t <- table(ds$model_type_new, ds$study_name);t[t==0]<-".";t

          eas elsa hrs ilse lasa nuage octo radc satsa
  0       .   .    .   .    .    .     .    .    20   
  a       .   .    .   .    .    10    .    113  10   
  ae      58  .    .   .    .    .     .    109  34   
  aeh     57  .    24  14   .    16    72   116  34   
  aehplus 57  18   28  25   18   16    58   113  40   
  age     56  .    24  14   .    6     72   .    24   
  aheplus 1   .    .   .    .    .     .    .    .    
  empty   2   .    .   4    .    10    .    .    .    
  full    58  .    .   .    .    .     4    .    .

@wibeasley , is this set up what you originally proposed? I'd like to get this simple case first, before moving on to a more populous cases, such as cognitive_measure. If yes, what am I missing here to make it work?

wibeasley commented 8 years ago

Everything looks like it's set up well. I'm going to change just a few things.

The real work is completed by this single left join. The rest of the commit's code is just clean up.

# Join the model data frame to the conversion data frame.
ds <- ds %>% 
  dplyr::left_join(ds_model_type_key, by=c("model_type"="entry"))

I believe the first table is what you were going for above. The second tables is essentially a transition matrix (from the old names, to the cleaned/condensed categories).

> t <- table(ds$category_short, ds$study_name);t[t==0]<-".";t
          eas elsa hrs ilse lasa nuage octo radc satsa
  0       2   .    .   4    .    10    .    .    20   
  a       56  .    24  14   .    16    72   113  34   
  ae      58  .    .   .    .    .     .    109  34   
  aeh     57  .    24  14   .    16    72   116  34   
  aehplus 58  18   28  25   18   16    58   113  40   
  full    58  .    .   .    .    .     4    .    .    

> t <- table(ds$model_type, ds$category_short);t[t==0]<-".";t
          0  a   ae  aeh aehplus full
  0       20 .   .   .   .       .   
  a       .  133 .   .   .       .   
  ae      .  .   201 .   .       .   
  aeh     .  .   .   333 .       .   
  aehplus .  .   .   .   373     .   
  age     .  196 .   .   .       .   
  aheplus .  .   .   .   1       .   
  empty   16 .   .   .   .       .   
  full    .  .   .   .   .       62

andkov commented 8 years ago

Ah, good! Thanks, @wibeasley. This set up is certainly more welcoming to the non-coding crew.

I wouldn't reach for the joins to do the work, so i'm glad you've shown this. I hoped there was a one line solution.

I like the new table, it's quite informative. It makes debugging easer.

I'll work through the rest of the items. I may need help when I get to incorporating sorting (into domains) into joins. Thanks again!

wibeasley commented 8 years ago

I'll work through the rest of the items. I may need help when I get to incorporating sorting (into domains) into joins. Thanks again!

No problem. Just tell me when.

I like the new table, it's quite informative. It makes debugging easier.

Yeah, I considered stringing together those dplyr statements, but it would have prevented us from peeking at the transition matrix.

And I like the format of a transition matrix. I usually use something like dplyr::count() for real tallying. But the table() display lets you see the "off diagonals" better. I like your touch replacing the zeros with a dot.

ampiccinin commented 8 years ago

Just a thought – can we assume

1) aeg is not aeh?

2) ahe is not age?

H and G are typed with different fingers, but are right next to each other on the keyboard. I guess there is a context that determines it – i.e., aeh show up in the filename and age is a covariate?

andkov commented 8 years ago

good point, @ampiccinin . we are not protected from errors like that. We won't be able to catch the mistake here if the name of the file came with that misspelling. However, when we'll look for fixed effects this mistake will be apparent and we'll have to come back, locate the file, and add a line to the .csv to rename the outcomes to correct for this misspelling. Yes, such misspellings are very costly to debug.

The idea of this .csv file is that it would offer an easy way to edit these corrections. column category_short contains what entry will be renamed into, while column notes gives explanation for this substitution. It's hard to anticipate ALL possible misspellings, as as you showed in your example, the interpretation may be highly contextual. So I think our strategy would be: watch our for things that don't make sense and edit the .csv by entering additional renaming rules.

We'll have a modification of this csv for classifying into domains as well. This is how we can make it customizable to every track, while keeping the bulk of the code stable across tracks (and projects).

IALSA / Portland-physical-cognitive

Filter tables - simple case with (model_type) #1