CrumpLab / EntropyTyping

A repository for collaborating on our new manuscript investigating how keystroke dynamics conform to information theoretic measures of entropy in the letters people type.
https://crumplab.github.io/EntropyTyping
6 stars 2 forks source link

Data pre-processing #10

Open nbrosowsky opened 6 years ago

nbrosowsky commented 6 years ago

I started to dig into the data a little bit and noticed there are probably some things we want to clean up:

I added this to my dplyr pipline to clean that up:

subject_means <- the_data %>%
             filter (
                      Letters != " ",                 #removes spaces (just in case they were assigned a letter position)
                      !grepl("[[:punct:]]",Letters),       #removes punctuation
                      !grepl("[0-9]",Letters),      #removes numbers
                      !grepl("[[A-Z]]*",whole_word) #removes whole words that have a capital letter
             )
CrumpLab commented 6 years ago

This is good!

Does the capital letter filter remove the whole word in the sense of removing all of the other letters in the word?

We could keep the other other letters, which give us more observations.

More generally, having a pre-processing thread like this is a very good idea.

nbrosowsky commented 6 years ago

Yeah, currently it removes the whole word.

If you change "whole_word" to "Letters" it'll just remove the individual letters.

CrumpLab commented 6 years ago

Outlier removal

Currently I just eliminated all IKSI above 2000 ms, this is a somewhat arbitrary cut-off. We have adopted a standard practice of using the Van Selst & Jolicouer procedure, so still need to add that.

Always worth noting that outlier elimination is a never ending issue in the sense that we have enormous degrees of freedom try any number of different elimination schemes. The worst, and most useless thing we could do would be to automate the process of trying a million different techniques, and then pick the one that "makes the data better".

We do need to justify the practice that we do adopt. One thing to do is be consistent (for example we should remind ourselves what we did for Behmer & Crump, 2016) and do the same thing here. If our findings depend on our choice of outlier elimination procedure, then we know that something is wrong with our experiment, and we are probably just measuring noise. So, another gut check here is to try a couple reasonable elimination procedures that get rid of the massive numbers (e.g., nobody takes 1000000 seconds to type a letter, those should be removed because one the participants must have left to make a sandwich or something).

Some elimination procedures are:

  1. arbitrary cutoff (500 ms, 1000ms, 2000ms)
  2. standard deviation cutoff (2SD or 3SD are common) <- are biased measures when cell-sizes are small
  3. transforms (log, inverse, etc.) <- not as common, whenever I see this I'm always skeptical of degrees of freedom issue, especially when there is no good reason for the transform
  4. median rather than the mean
  5. Van Selst & Jolicouer, a nice mix of SD cutoffs as a function of cell-size, along with a recursive method and monte-carlo simulations to show how it behaves. <- what we normally use.
CrumpLab commented 6 years ago

Cool, changing to "Letters" just does that. I like it when that stuff is easy.

wlai0611 commented 6 years ago

I was looking in the data and found entries whose whole_words have words like "Felis." or "vertebrae." and their word lengths are 1 more than the actual word because of punctuation. So this code corrects their word length

this gets all the rows with punctuation at the end of the word and subtract 1 from length

the_data[grepl("[[:punct:]]",substr(the_data$whole_word,nchar(the_data$whole_word) ,nchar(the_data$whole_word))),]$word_lengths=the_data[grepl("[[:punct:]]",substr(the_data$whole_word,nchar(the_data$whole_word) ,nchar(the_data$whole_word))),]$word_lengths-1

it should print 9 now not 10

the_data[the_data$whole_word=="vertebrae.",]$word_lengths

CrumpLab commented 6 years ago

great, I added that to my pre-processing