Open CrumpLab opened 6 years ago
I'm jumping ahead with some lme4 stuff. I also got a few lmer models working, and confirmed what I thought they were doing based on this graph:
This is for the first 20 subjects. I created a new factor to separate first letter positions from all other letter positions. Then I fit the model and plotted the model predictions along with the mean_IKSIs for each subject.
What is going on here is that the model is giving subject level intercepts and slopes for mean IKSI as a function of H and the categorical position factor (first vs. other). In some ways this is similar to running individual linear regressions for each subject.
I recalculated H according to the letter n-1 probability. Here's what I did (and I'll need someone to check my work):
H values for letter position 1 was calculated the same was as before.
Here's what that looks like:
Which looks a lot like...
Pearson's product-moment correlation
data: group_means$mean_IKSIs and group_means$H
t = 13.206, df = 43, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8168979 0.9416343
sample estimates:
cor
0.8956632
Wow, very cool!
Will look at your code hopefully tomorrow morning. This is a very good step to take. We talked before about the expanding number of ways we could calculate H, e.g., n-1, n-2, or any other n-x or n+x combination. Which is basically a scary infinity problem. But, mine as well do some more complicated things as you have done. Scatterplot looks really pretty.
Nick's result suggests that first-letter and mid-word slowing can be explained by a single process sensitive to letter uncertainty (when taking n-1 into account). When we ignore n-1 in calculating H, we find that H values in the first position are generally high, they dip in position 2, and show an inverted U pattern across the remaining letters. This is broadly consistent with first-letter and mid-word slowing, but it doesn't explain much of the variance in mean IKSI. For example, in the mixed model graph showing 20 subjects, we can see some evidence that first-letter IKSIs are generally longer than other letter_IKSIs. One might conclude that 1) H does correlate with mean IKSI for both first and other letters, but 2) because first letter IKSI are in general longer than other letter IKSIs, there is good evidence for an additional planning type process that adds time to beginning a sequence.
However, when we include n-1 in the calculation of letter uncertainty we find that H for letters in positions 2-9 drops quite a bit, compared to calculations of H that do not include n-1. This matches the mean IKSIs much better, and the results scatterplot shown by Nick does not suggest a separation between first letter IKSIs and other letter IKSIs.
I attempted to make the H (based on previous let_pos) vs. IKSI graph.
It is pretty similar:
Pearson's product-moment correlation
data: sum_data$H and sum_data$mean_IKSI t = 12.419, df = 43, p-value = 8.154e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7978596 0.9351025 sample estimates: cor 0.8842931
I didn't know how to make the first multiple plots but I attached the average H values corresponding to the letter positions so you could check them with yours:
Great, that does look pretty similar.
I was just thinking of this for the H (n-1) calculation : When we average the H values for each letter position, should we weight the H's from more frequent letters more? Like E's H value should have more weight than Z's H value?
Nick, any thoughts on this one?
So after step 2 from Nick's n-1 analysis, I used the ngrams1 file to get the probabilities for each letter and before averaging the 26 H values for each letter position I multiplied each letter's H by the letter's probability. Then I did step 3 (average them) and got the new 36 H values.
Pearson's product-moment correlation
data: sum_data$H and sum_data$mean_IKSI t = 13.5, df = 43, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8233514 0.9438274 sample estimates: cor 0.8994942
Good question. I did think about this when I was programming the n-1 version. In principle, I think it makes sense to, but I also think it just complicates things further.
I just saw Walter, you posted a new analysis while I was typing this.
The reason I think it's complicated is because you could calculate the general letter frequency (I think that's what you did) and weight it by that. But you could also calculate the frequency of each letter in the n-1 position and use those weights. So when you calculate H for letter position 2, you'd have to know the relative frequencies of each letter in position 1, then weight the position 2 H values by that. You'd have to do that for each letter position. Shouldn't be too difficult to do that, but it does complicate things further. I'll take a look at my code and see if I can add that.
Seeing the way the paper is unfolding right now, I think just using the average is an ok approximation. If we were to go further and do n-2 (bigrams), n-3 (trigrams), etc. I think we would definitely have to do something to deal with missing and rare values. Weighting them could deal with that.
I was weighting them by general letter frequency not the frequencies specific for each letter position.
I see what you are saying how weighting introduces unnecessary complications. Looking back at the results, I don't think my weighting changed the correlation much.
Thanks everyone for weighing in. It's interesting that the results don't change much when we weight the Hs. I've haven't thought about this in depth yet, so still not sure what is best practice here, or what we should expect, etc. This might be one of things where we present what we did, and see if the reviewers have better suggestions for other things to do.
Yeah, it's definitely not straightforward.
Especially considering that a letter that occurs often at n-1 (e.,g "E") likely has a higher uncertainty for position n than a letter that doesn't occur often at n-1 (e.g., 'Q'). Probably because a letter occurs more often if it can be paired with lots of different letters. So there's a relationship between frequency of n-1 letters and uncertainty at position n that complicates things.
Opening this thread for discussion on how best to determine whether letter uncertainty as a function of position and word length explains variance in mean IKSI as a function of position and word length.