wrong chisq reported - Githubissues

FertigLab / CoGAPS

Bayesian MCMC matrix factorization algorithm

https://www.bioconductor.org/packages/release/bioc/html/CoGAPS.html

BSD 3-Clause "New" or "Revised" License

66 stars 17 forks source link

wrong chisq reported #116

Closed dimalvovs closed 4 weeks ago

dimalvovs commented 1 month ago

It has been reported that mean Chi-square output by CoGAPS differs from the manually calculated one.

quoting the report:

> deathGAPS <- CoGAPS(rawsubT,uncertainty=smat,nPatterns=5,nIterations=5000,outputFrequency=500)
> getMeanChiSq(deathGAPS)
[1] 13009637
> deathA <- getAmplitudeMatrix(deathGAPS)
> deathP <- getPatternMatrix(deathGAPS)
> deathM <- rawsubT - deathA%*%t(deathP)
> sum((deathM/smat)^2)
[1] 13042899

let's find out the source of this error and correct it.

dimalvovs commented 1 month ago

a test introduced in dbd39fe085728efb6f804a4c702f9c635e28ab24, it's results are as follows before the fix:

Error: `reported` not equal to `calculated`.
1/1 mismatches
[1] 4075 - 4075 == -0.00547
> 0.00547/4075
[1] 1.342331e-06
> reported
[1] 4074.758
> calculated
[1] 4074.763

dimalvovs commented 4 weeks ago

Changing a lot of variables to float (see c6166cd678669d84f057529cc4df4467969c141f) allows to obtain an almost 40x smaller diff of 4426 - 4426 == -0.000145. Notably the actual value grew by 8.6%.

dimalvovs commented 4 weeks ago

Went further down to [1] "difference: 0.00006303062400548" in 0594745863594b206b64355a646583de6bf62ef0

dimalvovs commented 4 weeks ago

Reverted prev. changes and another take on bringing the values closer in 6a20d7b9a4e3615832ac7c4905e2768fa45cc911. It worked a bit worse:

> data(GIST)
> data <- GIST.data_frame
> unc <- 0.1*as.matrix(data)
> res <- CoGAPS(data, nIterations=1000, uncertainty = unc,
+                   seed=1, messages=FALSE, sparseOptimization=FALSE)

This is CoGAPS version 3.25.1 
Running Standard CoGAPS on data (1363 genes and 9 samples)
6276.828147
6087.639273
> reported <- getMeanChiSq(res)
> A <- getAmplitudeMatrix(res)
> P <- getPatternMatrix(res)
> M <- A %*% t(P)
> calculated <- sum(((data - M)/unc)^2)
> #Set the tolerance to float precision times matrix size
>     sprintf("difference: %.17f", abs(reported - calculated))
[1] "difference: 0.00547298026140197"
> reported; calculated
[1] 4074.758
[1] 4074.763

dimalvovs commented 4 weeks ago

All in all it does look that the difference is attributed to double/float conversions and we were able to bring it down 86 times to 0.00006303062400548 by changing various variables to double form float as shown here. Reverting the latest change, leaving only the test and closing.