createData and temporalAutocorrelation while still having random effects variance

melina-leite commented 1 month ago

Testing some data simulated by createData, I realized that even when we don't include temporal autocorrelation in the simulation, it can be present in the dataset:

set.seed(123)
testData <- createData(sampleSize = 200,
                       overdispersion = 0, randomEffectVariance = 0.5,
                       family = gaussian())
fittedModel <- lme4::lmer(observedResponse ~ Environment1 + (1|group),
                          data = testData)
simulationOutput <- simulateResiduals(fittedModel, n=200)
testTemporalAutocorrelation(simulationOutput = simulationOutput,
                            time = testData$time)

set.seed(123)
testData <- createData(sampleSize = 200, temporalAutocorrelation = 1,
                       overdispersion = 0, randomEffectVariance = 0.5,
                       family = gaussian())
head(testData)

fittedModel <- lme4::lmer(observedResponse ~ Environment1 + (1|group),
                          data = testData)
summary(fittedModel)
simulationOutput <- simulateResiduals(fittedModel, n=200)
testTemporalAutocorrelation(simulationOutput = simulationOutput,
                            time = testData$time)

I suspected that it could be because the time column is ordered and the RE groups are also ordered. However, trying to shuffle the time column to have the time variable unclustered within group didn't solve the issue. I tried to change the createData function:

#line 36
time = sample(1:sampleSize, sampleSize) # change here

#lines 78 - 90
 t = 1:sampleSize            
 distMat <- as.matrix(dist(t))

 invDistMat <- 1/distMat * 5000
diag(invDistMat) <- 0
invDistMat = sfsmisc::posdefify(invDistMat)

temporalError <- MASS::mvrnorm(n = 1, mu = rep(0,sampleSize), Sigma = invDistMat)

linearResponse = linearResponse + temporalAutocorrelation * temporalError[time] # reordering according to time

And now, even with very strong temporal autocorrelation, the residuals don't show the pattern:

set.seed(123)
testData2 <- createData2(sampleSize = 200,
                       overdispersion = 0, randomEffectVariance = 0.5,
                       family = gaussian())
fittedModel <- lme4::lmer(observedResponse ~ Environment1 + (1|group),
                          data = testData2)
simulationOutput <- simulateResiduals(fittedModel, n=200)
testTemporalAutocorrelation(simulationOutput = simulationOutput,
                            time = testData2$time)

set.seed(123)
testData2 <- createData2(sampleSize = 100, temporalAutocorrelation = 2,
                       overdispersion = 0, randomEffectVariance = 0.5,
                       family = gaussian())
fittedModel <- lme4::lmer(observedResponse ~ Environment1 + (1|group),
                          data = testData2)
simulationOutput <- simulateResiduals(fittedModel, n=200)
testTemporalAutocorrelation(simulationOutput = simulationOutput,
                            time = testData2$time)

I'm puzzled. The code is in Code/DHARMaIssues/436.

florianhartig commented 1 month ago

Running

testData = createData(sampleSize = 100, family = gaussian(), fixedEffects = 1,
                      randomEffectVariance = 0, temporalAutocorrelation = 10)

fittedModel <- lm(observedResponse ~ Environment1, data = testData)
res = simulateResiduals(fittedModel)

testTemporalAutocorrelation(res, time =  testData$time)

I don't see any differences when switching between

  time = 1:sampleSize
  time = sample.int(sampleSize)

in the simulate function

florianhartig commented 1 month ago

So, to me it seems the simulation with unordered time actually works fine, in which case we should switch to this as otherwise we'll indeed get a spurious correlation with the REs.

melina-leite commented 1 month ago

I think the problem happens when we have random effects together with the temporal autocorrelation. I did some tests to calculate the proportion of significative p.values for the temporal autocorrelation test for the 4 cases of simulation (100 simulations each):

temporalAutocorrelation = 0 or = 20 (very high)
randomEffectVariance = 0 or = 0.5 (not so high)

Results for createData (originally ordered): temporal random.effect prop.sig 1 no no 4 2 yes no 100 3 no yes 66 4 yes yes 100

Results for `createdData2 (with the changes as presented above) temporal random.effect prop.sig 1 no no 9 2 yes no 100 3 no yes 4 4 yes yes 95

So the type I error for the original-ordered createData is higher when you don't have temporal autocorrelation but have some random effects, right?

I added the code of the simulations to the file 436.R.

florianhartig commented 1 month ago

Yes sure, we have to have the time unordered, else it's correlated with the RE. So we want to set

time = sample.int(sampleSize)

But once that is set, I didn't see any problems, so I don't understand why we thought that is a problem yesterday.

melina-leite commented 1 month ago

Yesterday, I had a problem when I was doing the tests for the functions simulating data without temporal autocorrelation but still getting temporal autocorrelation. Then, the tests failed in the expectation. I was not just time = sample.int(sampleSize) that I changed (see first comment). So, should I open a pull request with the changes in the function, and you review it? Or could I just commit to the main branch? Not sure what would be the best in this case of changing a very very useful and used function.

florianhartig / DHARMa

createData and temporalAutocorrelation while still having random effects variance #436