Open AndiMunteanu opened 2 years ago
Thanks for this excellent issue!
I agree that storage format should not alter the resulst beyond usual floating point limits. I'm investigating this example carefully and trying to get to the bottom of this.
My 2c: this doesn't seem particularly unusual for an iterative algorithm if there are differences in numerical precision for the sparse matrix multiplication operator (based on CHOLMOD IIRC) and its dense counterpart (LAPACK's dgemm
). From experience with other algorithms - namely the C++ code in Rtsne - I've noticed that very minor changes in precision - e.g., flipping the least significant bit in a double-precision value - can happily propagate into very large differences in the final result.
I've been using the
irlba
package on the same input stored both as a dense and as a sparse matrix; I noticed that the PCA output is influenced by the type of matrix storage format. Here is an example to illustrate this point. I ranirlba
on thepbmc_small
dataset (toy dataset, part of theSeurat
package)The dense and the sparse objects stem from the same initial matrix. The seed was set to the same value (2016); the other parameters,
nv
andtol
were set to the same values for both instances (50 and 1e-5).If we subtract the absolute values of
dense_embedding
andsparse_embedding
, we get a maximum value of1.902228e-08
(the code I've used for this wasmax(abs(abs(dense_embedding) - abs(sparse_embedding)))
). I also plotted the difference distributions between the two embeddings across each Principal Component.Although I do not consider 2e-08 being a negligible value, working with larger datasets results in even higher differences (the following plot was created on a dataset with 1880 points and 31832 features, where the maximum of the differences between the PCAs was 0.0014). The dotted line indicates the value of the tolerance parameter, set by default to 1e-05.
I an happy to share the code used for generating this plot if required.
My question is: should changing the matrix storage format affect the
irlba
results in the way seen above? Below is thesessionInfo()
output: