Closed UrszulaCzerwinska closed 6 years ago
This problem can be related to threshold or number of genes we include in enrichment. I have remarked "interpretation" difference for some datasets depending on how many "top" genes from ICs were included for enrichment. Sort of sensitivity study is necessary
So I have run enrichment for BRCA_TCGA data selecting for >0.98 quantile by 0.001 (0.1%) so varying from 420 to 21 genes by 21 genes The results didn't change at all for Myeloid cell but it definitely changed for T cells /B cells So what I observed is that for lower thresholds (more genes) T cells are the most enriched till quantile 0.995 and then it switches to B cells with very high probability 8/12 genes So, what I see is that top driving genes are B cell genes (some B cell highly specific)
For 3 decompositions of CIT
Maybe we should add some kind of stabilization i.e. decompose several times and keep only stable ones???
I increased reproducibility by maxit
and decreased tol
that increased the repoductibility. We can also cheat and set the seed fixed inside run_fastica
I have then always 4 components that pass correlation threshold. However, the interpretation slightly differs (less than before). I will try to decrease tol
and increase maxit
even more
We can see that the lowest correlation between runs concerns 4th component, however, even little change cause consequences in enrichment test
So this can be mainly settled by ICASSO stabilization.
It is working efficiently in MATLAB. This is why in unofficial version of the package will be possibility to call matlab ica with icasso
I also used Biton MineICA::clusterFastICARuns()
function, however, I had problems with MineICA installation as it depends on too many packages... therefore I copied and adapted the funcition.
Testing now how slow it is ...
Opening another issue for the enrichment test
res.test.2 <- run_fastica(METABRIC.cen, optimal = TRUE, row.center = TRUE, with.names = FALSE, alg.typ = "parallel", gene.names = row.names(METABRIC.cen), method = "C", n.comp = 100, isLog = TRUE, R = TRUE, stabilize = TRUE, funClus = "hclust", methodClust = "average", nbIt = 100)
Time difference of 5.040406 hours
this stabilisation results are really different from Matlab ones and from what we can expect
we can see that the results of matlab and r icasso is not the same as far as partitions are concerned. The weird fact is that R seems to overcluster the stable components which false the results...
I tried to figure it out. It looks like once I give the distance matrix to R code it works fine; but when I put the distance matrix from R to Matlab it works fine too. I also tried a different R implementation but it didn't work well either in practice. I call it a day, we will recommend to use MATLAB or Docker
Sometimes we get different number of immune component candidates (stroma ones don’t always pass the threshold) - possibility: not take them into account for deconvolution