florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
201 stars 21 forks source link

Binomial k/n model fit with zero weights in the data --> Simulation from the model produced wrong dimension #325

Closed 0avasns closed 2 years ago

0avasns commented 2 years ago

(pasting R history from Console below; note DHARMa reports missing mgcViz although the package is available and indeed loaded)


> library(mgcv)
Loading required package: nlme
This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
> library(mgcViz)
Loading required package: qgam
Loading required package: ggplot2
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Registered S3 method overwritten by 'mgcViz':
  method from  
  +.gg   GGally

Attaching package: ‘mgcViz’

The following objects are masked from ‘package:stats’:

    qqline, qqnorm, qqplot

> library(DHARMa)
This is DHARMa 0.4.5. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
> 
> load("model_newm4b.rda")
> summary(newm4b)

Family: binomial 
Link function: logit 

Formula:
TargetLooks ~ str_type + s(TRIAL_INDEX) + s(Time) + s(Time, by = str_type) + 
    s(Time, Subject, bs = "fs", m = 1) + s(Time, Item, 
    bs = "fs", m = 1)

Parametric coefficients:
                   Estimate Std. Error z value Pr(>|z|)   
(Intercept)          0.8199     0.2673   3.067  0.00216 **
str_typefirstthird   0.3294     0.3191   1.032  0.30196   
str_typesecfirst     0.1474     0.3200   0.461  0.64506   
str_typesecthird     0.3808     0.3062   1.244  0.21356   
str_typethirdfirst   0.3761     0.2996   1.255  0.20942   
str_typethirdsec     0.2895     0.3157   0.917  0.35914   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                               edf  Ref.df    Chi.sq p-value    
s(TRIAL_INDEX)               8.968   9.000  2542.274  <2e-16 ***
s(Time)                      5.346   5.652    49.779  <2e-16 ***
s(Time):str_typefirstsec     5.077   5.428     2.233   0.858    
s(Time):str_typefirstthird   5.113   5.433     6.304   0.310    
s(Time):str_typesecfirst     7.579   7.778     5.593   0.669    
s(Time):str_typesecthird     1.000   1.001     0.229   0.632    
s(Time):str_typethirdfirst   1.001   1.001     0.408   0.524    
s(Time):str_typethirdsec     3.201   3.504     7.267   0.101    
s(Time,Subject)            318.611 332.000 63911.473  <2e-16 ***
s(Time,Item)               448.166 489.000 56193.651  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Rank: 905/906
R-sq.(adj) =  0.191   Deviance explained = 18.2%
fREML = 7.971e+05  Scale est. = 1         n = 99860
> 
> simnewm4b <- simulateResiduals(fittedModel=newm4b, plot=F)
It seems you don't have mgcViz installed on this computer. When using DHARMa with mgcv objects, it is highly recommended to also install mgcViz, which will extend the ability of DHARMa to simulate from various gam objects. Withoug mgcViz, simulations will fail in various situations. See vignette for details!
Error in securityAssertion("Simulation from the model produced wrong dimension",  : 
  Message from DHARMa: During the execution of a DHARMa function, some unexpected conditions occurred. Even if you didn't get an error, your results may not be reliable. Please check with the help if you use the functions as intended. If you think that the error is not on your side, I would be grateful if you could report the problem at https://github.com/florianhartig/DHARMa/issues 

 Context: Simulation from the model produced wrong dimension
> plot(simnewm4b)
Error in plot(simnewm4b) : object 'simnewm4b' not found
> 
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] DHARMa_0.4.5  mgcViz_0.1.9  ggplot2_3.3.5 qgam_1.3.4    mgcv_1.8-38   nlme_3.1-153 
``
florianhartig commented 2 years ago

Hi Athanassios,

thanks for reporting this, yes, that's weird.

Could you try running the command: "mgcViz" %in% installed.packages() on your system?

I think you wrote to me by email that you use several computers with different OS, is this happening on all systems or only under Windows?

Cheers, Florian

0avasns commented 2 years ago

"mgvViz" is not in installed.packages() on my two Windows systems but it is there on the Linux system. Accordingly, I get the warning about mgcViz on Windows but not on Linux. Not sure how installed.packages() works, but it seems to be missing plenty of installed (and well-functioning) packages; perhaps it has to do with them being on a remote network drive (so that our $HOME is mounted across institutional machines regardless of access point). So if you are relying on installed.packages perhaps you aren't seeing stuff that's not local.

Anyway this warning does not seem to have a catastrophic effect as DHARMa 0.4.5 has worked fine for other models (both binomial and gaussian) on the same systems despite the mgcViz warning. It seems to be this particular model that leads to the failed assertion. I can share the offending model if it can help (its summary can be seen here).

florianhartig commented 2 years ago

OK, thanks, just to be sure: the script above ran completely on your Windows system, so library(mgcViz) executes on your Windows systems without problem, despite mgcViz not appearing in the output of installed.packages()?

If that is so, the problem is clearly that DHARMa doesn't recognise that you have mgcViz installed. In this case, simulateResiduals will fall back on the simulate function of mgcv, which doesn't work in all cases.

If it's not too much work, could you have a look if command find.package("mgcViz") returns a valid path on your Windows machine?

0avasns commented 2 years ago

Yes, and yes. library(mgcViz) executes without problem despite mgcViz not appearing in installed.packages(). And find.package() finds it. The following snippet from R console is right after an R restart:

> library(mgcViz)
Loading required package: mgcv
Loading required package: nlme
This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
Loading required package: qgam
Loading required package: ggplot2
Learn more about the underlying theory at https://ggplot2-book.org/
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Registered S3 method overwritten by 'mgcViz':
  method from  
  +.gg   GGally

Attaching package: ‘mgcViz’

The following objects are masked from ‘package:stats’:

    qqline, qqnorm, qqplot

> "mgcViz" %in% installed.packages("mgcViz")
[1] FALSE
> find.package("mgcViz")
[1] "M:/pc/Dokumenter/R/win-library/4.1/mgcViz"
>

However, I doubt that this is the whole story, because the same problem with the failed assertion occurs on the Linux cluster, where mgcViz is detected without problem:

> library(mgcv)
Loading required package: nlme
This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
> library(mgcViz)
Loading required package: qgam
Loading required package: ggplot2
Registered S3 method overwritten by 'GGally':
  method from
  +.gg   ggplot2
Registered S3 method overwritten by 'mgcViz':
  method from
  +.gg   GGally

Attaching package: ‘mgcViz’

The following objects are masked from ‘package:stats’:

    qqline, qqnorm, qqplot

> library(DHARMa)
This is DHARMa 0.4.5. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
> "mgcViz" %in% installed.packages()
[1] TRUE
> load("model_newm4b.rda")
> simnewm4b <- simulateResiduals(fittedModel=newm4b, plot=F)
Error in securityAssertion("Simulation from the model produced wrong dimension",  :
  Message from DHARMa: During the execution of a DHARMa function, some unexpected conditions occurred. Even if you didn't get an error, your results may not be reliable. Please check with the help if you use the functions as intended. If you think that the error is not on your side, I would be grateful if you could report the problem at https://github.com/florianhartig/DHARMa/issues

 Context: Simulation from the model produced wrong dimension
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: aarch64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.5 (Ootpa)

Matrix products: default
BLAS/LAPACK: /opt/software/easybuild/software/OpenBLAS/0.3.12-GCC-10.2.0/lib/libopenblas_cortexa72p-r0.3.12.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] DHARMa_0.4.5  mgcViz_0.1.9  ggplot2_3.3.5 qgam_1.3.4    mgcv_1.8-38
[6] nlme_3.1-153

I have sent you the fitted model in case it can help debug.

florianhartig commented 2 years ago

OK, so the package problem is solved with https://github.com/florianhartig/DHARMa/commit/a50f0a81a48f2cdbb0842c386c58702e19039ffc I hope.

You are right, the dimension problem has a different reason indeed, the error occurs because for your model, nobs(newm4b) produces a different number of dimensions than what is actually present in your model.frame.

> nobs(newm4b)
[1] 75229
> newm4b$df.null
[1] 99860
> nrow(newm4b$model)
[1] 99860

This is happening because you have a considerable number of rows in the model frame where you effectively have zero observations, e.g.

image

So you are making binomial trials without trials here.

I could overwrite the error, but I'm not sure if this makes much sense, you have zero observations, and all your simulations will be zero as well, if you calculate residuals for these "ghost observations" you will basically add random noise to your plots that is not based on data.

I don't think it's worth to implement code to automatically handle this, as this seems to be a rather exotic case and I think you would be better off re-fitting them model with a minimal data set. I will therefore implement a more explicit warning about this.

Or do you have a particular reason for wanting to include all data in the fit?

florianhartig commented 2 years ago

I have added a more specific warning about this! Many thanks for reporting this, these kind of issues are very helpful!

0avasns commented 2 years ago

Thanks a lot for solving this. A warning sounds like the most appropriate solution to me. It was not intentional to have zero-observation rows. This is an older model I wanted to check posthumously :-) so I'll have to dig up the cause of this situation; and be more careful in the future.