aiorazabala / qmethod

R package to analyse Q methodology data
GNU General Public License v2.0
34 stars 18 forks source link

use population, not sample cor #116

Closed maxheld83 closed 9 years ago

maxheld83 commented 9 years ago

it would appear that the same baby-bug that plagues #65 also appears in calculating the correlation matrix.

qmethod() simply calls:

cor.data <- cor(dataset, method=cor.method)

The documentation for cor() says:

The denominator n - 1 is used which gives an unbiased estimator of the (co)variance for i.i.d. observations. These functions return NA when there is only one observation (whereas S-PLUS has been returning NaN), and fail if x has length zero.

By contrast, Brown (1980: 272) seems to be using the population numerator.

This is only a hunch right now, and it might not matter (much). Will report findings back here.

maxheld83 commented 9 years ago

uh, that is confusing: Brown (1980: 267) uses r to denote the correlation coefficient, which usually used to denote the sample variant.

maxheld83 commented 9 years ago

anyway, I'm not clear about this yet, so I'm tagging it as question, definitely not bug. It's also likely that again (as in #65) the effects will be small either way, and affect all Q-sorts and factors in the same way.

aiorazabala commented 9 years ago

I see. The way forward seems unclear. If you wish, you could make tests with population sd and sample sd and see how it compares to PQMethod results (there are sample datasets in Peter Schmolck's website). For the time being I'd stay with whatever matches with PQMethod results, and keep the theoretical conversation open.

maxheld83 commented 9 years ago

absolutely, agreed @aiorazabala ! I'm not planning on changing this until/unless I would be able to systematically test this (I think I linked to the testing issue #89 in #65 too). I just wanted to add this here, for now, to make sure that we somehow keep track of these little things.

maxheld83 commented 9 years ago

It now appears – at least on cursory inspection – that there is no difference between Brown (1980: 205) and cor(). Brown (1980: 205) reports a Pearson's r of 0.5375 (rounded to the printed 0.54). cor(), using the same data (code pasted below), yields 0.5375

It's still confusing to me that Brown seems to talk about the population variant of Pearson's r. (Let's see what the email says, see below)

brown205 <- matrix(
    c(
        4,4, #1
        5,5, #2
        3,4, #3
        5,2, #4
        3,7, #5
        6,8, #6
        5,6, #7
        4,6, #8
        5,1, #9
        4,5, #10
        6,7, #11
        6,4, #12
        7,9, #13
        8,4, #14
        4,6, #15
        1,2, #16
        2,5, #17
        2,3, #18
        4,3, #19
        1,5, #20
        8,8, #21
        7,3, #22
        8,6, #23
        6,5, #24
        6,7, #25
        5,6, #26
        9,7, #27
        7,8, #28
        7,4, #29
        2,1, #30
        3,3, #31
        3,2, #32
        9,9 #33
        ), ncol=2, byrow = TRUE)
cor(brown205[,1], brown205[,2])
maxheld83 commented 9 years ago

I just sent this email:

Dear Dr Brown, Dear Dr Schmolck,

following up on my earlier e-mail about the correct version of the Standard Deviation (which turned out to be a trivial matter), I have come across a similar problem with the appropriate version of Pearson's r.

What would be appropriate – the population or sample variant – and how is this implemented in QMethod?

If I understand correctly, Brown is using the population variant throughout Political Subjectivity, though the coefficient is denoted as r, which, I believe is conventionally used to identify the sample correlation coefficient (as opposed to rho for the population coefficient). That's a bit confusing to me.

Conceptually, I would think that N (not N-1) would be applicable, once more, because at least in forced distributions the population mean is actually known (0), and there would be no need to de-bias a mean estimate (correct?). On the other hand (and when the distribution is allowed to be asymmetrical), the (population) mean is an estimate, and the items are - at least nominally - a sample from a broader concourse.

I understand that this is very likely a trivial matter that won't affect results (much), and I'm sorry to take up your time with this – still I'd like to make sure that qmethod returns exactly correct results, if only to be able to verify and reproduce results precisely.

Many thanks, Max

maxheld83 commented 9 years ago

Response Dr Schmolck:

I'm not a statisticaian, but I don't know that the 'normal' r is a biased estimator of Rho. The formula for r contains covariance and variances, that exist as formulas with n and n-1. But which version you use doesn't make a difference because the difference cancels out. See http://de.wikipedia.org/wiki/Korrelationskoeffizient, though the English version is different.... (http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient). In any case, if there exists also an n-1 formula for an unbiased estimate of Rho, it would not be the appropriate one for factor analysis, I believe.

Peter

Duh, indeed, I should have thought about that – it cancels out.

maxheld83 commented 9 years ago

Response Dr Brown:

Max,

I don’t have a good answer for this. As you point out, no parameters are being estimated that would result in a loss of degrees of freedom. The Q sort is a model of the population, which has unknown parameters since the population of subjective communicability is unlimited, and the Q sort is purposely constructed (when a structured sample is used) rather than randomly selected. Incidentally, Nahinsky worked out the degrees of freedom in Q in the ANOVA case when there was no degree of freedom lost for estimating the mean:

Nahinsky, I.D. (1965). The analysis of variance of Q sort data. Journal of Experimental Education, 34(1), 66-72. Nahinsky, I.D. (1966). Analysis of variance of intraindividual Q sort patterns. Journal of Clinical Psychology, 22, 34-39. Nahinsky, I.D. (1967). A Q sort analysis of variance involving the dimensions of sorts, groups, and items. > Journal of Experimental Education, 35(3), 36-41.

I suppose the letter r is preferred in as much as the Q sample is not the entire population, but r is rarely if ever subject to tests of statistical inference, and the same for the factor loadings. In fact, in a science of subjectivity we cannot guarantee that any statement means the same thing to two different people, and so the correlation between them must be taken with a grain of salt. As with the standard deviation, the difference between N and n – 1 is apt to be quite trivial in the larger scheme of things.

As usual, good luck.

maxheld83 commented 9 years ago

ok, so I think we can safely ignore this, because the results will be the same no matter which you use. That also explains why the results are the same as in Brown 1980, using cor().

I really need to get a little smarter about this elementary stuff.