CrumpLab / EntropyTyping

A repository for collaborating on our new manuscript investigating how keystroke dynamics conform to information theoretic measures of entropy in the letters people type.
https://crumplab.github.io/EntropyTyping
6 stars 2 forks source link

Analysis: Does letter uncertainty explain variation in mean IKSI? #17

Open CrumpLab opened 6 years ago

CrumpLab commented 6 years ago

Opening this thread for discussion on how best to determine whether letter uncertainty as a function of position and word length explains variance in mean IKSI as a function of position and word length.

CrumpLab commented 6 years ago

I'm jumping ahead with some lme4 stuff. I also got a few lmer models working, and confirmed what I thought they were doing based on this graph:

screen shot 2018-06-19 at 5 41 22 pm

This is for the first 20 subjects. I created a new factor to separate first letter positions from all other letter positions. Then I fit the model and plotted the model predictions along with the mean_IKSIs for each subject.

What is going on here is that the model is giving subject level intercepts and slopes for mean IKSI as a function of H and the categorical position factor (first vs. other). In some ways this is similar to running individual linear regressions for each subject.

nbrosowsky commented 6 years ago

I recalculated H according to the letter n-1 probability. Here's what I did (and I'll need someone to check my work):

  1. Calculate H for every letter position given each n-1 letter. So calculate H for letter position 2 / word length 2 given n-1 was "A", calculate H given n-1 was "B", etc..
  2. This gives 26 H values for every letter position
  3. Then I just averaged over all 26 H values to get a single H value per position

H values for letter position 1 was calculated the same was as before.

Here's what that looks like:

image

Which looks a lot like...

image

image

Pearson's product-moment correlation 
data:  group_means$mean_IKSIs and group_means$H
t = 13.206, df = 43, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8168979 0.9416343
sample estimates:
 cor 
 0.8956632
CrumpLab commented 6 years ago

Wow, very cool!

Will look at your code hopefully tomorrow morning. This is a very good step to take. We talked before about the expanding number of ways we could calculate H, e.g., n-1, n-2, or any other n-x or n+x combination. Which is basically a scary infinity problem. But, mine as well do some more complicated things as you have done. Scatterplot looks really pretty.

CrumpLab commented 6 years ago

Nick's result suggests that first-letter and mid-word slowing can be explained by a single process sensitive to letter uncertainty (when taking n-1 into account). When we ignore n-1 in calculating H, we find that H values in the first position are generally high, they dip in position 2, and show an inverted U pattern across the remaining letters. This is broadly consistent with first-letter and mid-word slowing, but it doesn't explain much of the variance in mean IKSI. For example, in the mixed model graph showing 20 subjects, we can see some evidence that first-letter IKSIs are generally longer than other letter_IKSIs. One might conclude that 1) H does correlate with mean IKSI for both first and other letters, but 2) because first letter IKSI are in general longer than other letter IKSIs, there is good evidence for an additional planning type process that adds time to beginning a sequence.

However, when we include n-1 in the calculation of letter uncertainty we find that H for letters in positions 2-9 drops quite a bit, compared to calculations of H that do not include n-1. This matches the mean IKSIs much better, and the results scatterplot shown by Nick does not suggest a separation between first letter IKSIs and other letter IKSIs.

wlai0611 commented 6 years ago

I attempted to make the H (based on previous let_pos) vs. IKSI graph.
It is pretty similar: hbasedonpreviouspos

Pearson's product-moment correlation

data: sum_data$H and sum_data$mean_IKSI t = 12.419, df = 43, p-value = 8.154e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7978596 0.9351025 sample estimates: cor 0.8842931

I didn't know how to make the first multiple plots but I attached the average H values corresponding to the letter positions so you could check them with yours:

HValuesBasedonPrevious.txt

CrumpLab commented 6 years ago

Great, that does look pretty similar.

wlai0611 commented 6 years ago

I was just thinking of this for the H (n-1) calculation : When we average the H values for each letter position, should we weight the H's from more frequent letters more? Like E's H value should have more weight than Z's H value?

CrumpLab commented 6 years ago

Nick, any thoughts on this one?

wlai0611 commented 6 years ago

So after step 2 from Nick's n-1 analysis, I used the ngrams1 file to get the probabilities for each letter and before averaging the 26 H values for each letter position I multiplied each letter's H by the letter's probability. Then I did step 3 (average them) and got the new 36 H values.

weightedhnmin1 Pearson's product-moment correlation

data: sum_data$H and sum_data$mean_IKSI t = 13.5, df = 43, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8233514 0.9438274 sample estimates: cor 0.8994942

nbrosowsky commented 6 years ago

Good question. I did think about this when I was programming the n-1 version. In principle, I think it makes sense to, but I also think it just complicates things further.

I just saw Walter, you posted a new analysis while I was typing this.

The reason I think it's complicated is because you could calculate the general letter frequency (I think that's what you did) and weight it by that. But you could also calculate the frequency of each letter in the n-1 position and use those weights. So when you calculate H for letter position 2, you'd have to know the relative frequencies of each letter in position 1, then weight the position 2 H values by that. You'd have to do that for each letter position. Shouldn't be too difficult to do that, but it does complicate things further. I'll take a look at my code and see if I can add that.

Seeing the way the paper is unfolding right now, I think just using the average is an ok approximation. If we were to go further and do n-2 (bigrams), n-3 (trigrams), etc. I think we would definitely have to do something to deal with missing and rare values. Weighting them could deal with that.

wlai0611 commented 6 years ago

I was weighting them by general letter frequency not the frequencies specific for each letter position.
I see what you are saying how weighting introduces unnecessary complications. Looking back at the results, I don't think my weighting changed the correlation much.

CrumpLab commented 6 years ago

Thanks everyone for weighing in. It's interesting that the results don't change much when we weight the Hs. I've haven't thought about this in depth yet, so still not sure what is best practice here, or what we should expect, etc. This might be one of things where we present what we did, and see if the reviewers have better suggestions for other things to do.

nbrosowsky commented 6 years ago

Yeah, it's definitely not straightforward.

Especially considering that a letter that occurs often at n-1 (e.,g "E") likely has a higher uncertainty for position n than a letter that doesn't occur often at n-1 (e.g., 'Q'). Probably because a letter occurs more often if it can be paired with lots of different letters. So there's a relationship between frequency of n-1 letters and uncertainty at position n that complicates things.