Closed hpages closed 2 years ago
Just a guess on my part: demo <- m.out[["1"]]
is using the label name "1" to identify a cluster of cells that is not the intended one (in 3.14-3.16). "1" was the name of the cluster that has strong expression of CD3D in some previous version. What changed to give the desired cluster a different name is the mystery. (It would be good to go over the other elements of m.out and find which ones have (or which one has) CD3D in the head of the Symbol field.)
Exactly. This is what m.out[["1"]]
used to look like (about 1 year ago):
demo <- m.out[["1"]]
as.data.frame(demo[1:10,c("Symbol", "Top", "p.value", "FDR")])
# Symbol Top p.value FDR
# ENSG00000172116 CD8B 1 7.352025e-100 8.831478e-97
# ENSG00000167286 CD3D 1 1.545028e-204 4.825433e-201
# ENSG00000111716 LDHB 1 5.445106e-146 9.447864e-143
# ENSG00000213741 RPS29 1 0.000000e+00 0.000000e+00
# ENSG00000171858 RPS21 1 0.000000e+00 0.000000e+00
# ENSG00000171223 JUNB 1 8.879883e-235 4.622275e-231
# ENSG00000177954 RPS27 2 1.045283e-296 6.529257e-293
# ENSG00000153563 CD8A 2 1.000000e+00 1.000000e+00
# ENSG00000136942 RPL35 2 0.000000e+00 0.000000e+00
# ENSG00000198851 CD3E 2 6.772585e-174 1.410143e-170
This is what we actually see in OSCA.multisample 1.1.0, the last successful build of OSCA.multisample (see "3.3 After blocking on the batch" section here).
But at some point something changed and today's m.out[["1"]]
looks like this:
demo <- m.out[["1"]]
as.data.frame(demo[1:10,c("Symbol", "Top", "p.value", "FDR")])
# Symbol Top p.value FDR
# ENSG00000177606 JUN 1 3.830494e-203 1.495425e-199
# ENSG00000196154 S100A4 1 3.113656e-260 1.389225e-256
# ENSG00000227507 LTB 1 0.000000e+00 0.000000e+00
# ENSG00000008517 IL32 1 0.000000e+00 0.000000e+00
# ENSG00000171858 RPS21 1 0.000000e+00 0.000000e+00
# ENSG00000124614 RPS10 2 0.000000e+00 0.000000e+00
# ENSG00000135046 ANXA1 2 1.062389e-56 1.276174e-53
# ENSG00000167286 CD3D 2 1.883686e-277 9.805215e-274
# ENSG00000111716 LDHB 2 1.080598e-168 2.812436e-165
# ENSG00000170345 FOS 2 1.861859e-202 6.461063e-199
A very different cluster! (even though CD3D is still here but is no longer in the head).
And the old m.out[["1"]]
can be seen in today's m.out[["10"]]
:
demo <- m.out[["10"]]
as.data.frame(demo[1:10,c("Symbol", "Top", "p.value", "FDR")])
# Symbol Top p.value FDR
# ENSG00000172116 CD8B 1 6.165199e-101 7.405827e-98
# ENSG00000167286 CD3D 1 5.348343e-205 1.855994e-201
# ENSG00000111716 LDHB 1 4.784764e-149 8.790456e-146
# ENSG00000213741 RPS29 1 0.000000e+00 0.000000e+00
# ENSG00000171858 RPS21 1 0.000000e+00 0.000000e+00
# ENSG00000171223 JUNB 1 1.023935e-233 4.568506e-230
# ENSG00000177954 RPS27 2 2.482226e-294 1.550498e-290
# ENSG00000153563 CD8A 2 1.000000e+00 1.000000e+00
# ENSG00000124614 RPS10 2 0.000000e+00 0.000000e+00
# ENSG00000198851 CD3E 2 1.518088e-173 3.160861e-170
Looks like the clustering is taken care of by igraph::cluster_walktrap
and/or igraph::cluster_louvain
and it seems that these functions have changed in recent versions of igraph. If I downgrade igraph to version 1.2.6 (this is the version that was used to build OSCA.multisample 1.1.0), then I get an m.out[["1"]]
that looks like the old m.out[["1"]]
again.
So what should we do? Replace m.out[["1"]]
with m.out[["10"]]
or replace the CD3D/CD8B pair with the S100A4/JUN pair?
@LTLA ?
H.
Nice work. I think it would be appropriate to replace "1" by "10" at this point. But beyond this I think it would be a good idea to come up with labels that have semantic value. I don't have a general approach, but a step in this direction would be to take a digest of the top 10 gene symbols in each cluster, for example. Use those digests as the names of the components of m.out
. The clustering procedure might permute the labels "1", ..., "10" for whatever reason, but we would superimpose labels that are based on the genes, and interrogate the list using digest-based labels. The digest-based label might also map to a "cell type". This should work if the only change between versions concerns permutation of the labels. If in addition the clustering/analysis procedure does not preserve the ordered gene lists, the digests in the previous version won't be present and an error would be thrown. This could obviate the need for the stopifnot() clauses.
Yes, replacing 1->10 would be the immediate fix.
The general solution would be what Vince described, and is just something that we have to accommodate when dependencies inevitably change from under us. (In this case, igraph's updates.)
A possibly elegant solution would be to implement a chooseCluster()
function somewhere that chooses a cluster based on its top markers. We could then replace all hard-coded cluster numbers with the output of chooseCluster()
, run on the marker genes. Similarly, the inline text could use the output of chooseCluster()
to make it look like we knew which cluster we wanted all along.
There are still some stop()
s unrelated to marker genes, though those should be much less prominent.
Definitely a change in the ordering of the clusters returned by igraph::cluster_louvain()
. With the following graph:
library(igraph)
g <- make_full_graph(4) %du% make_full_graph(2) %du% make_full_graph(8) %du% make_full_graph(5)
g <- add_edges(g, c(6, 15))
igraph 1.2.6 produces:
igraph::cluster_louvain(g)
# IGRAPH clustering multi level, groups: 3, mod: 0.54
# + groups:
# $`1`
# [1] 1 2 3 4
#
# $`2`
# [1] 7 8 9 10 11 12 13 14
#
# $`3`
# [1] 5 6 15 16 17 18 19
igraph 1.3.1 produces:
igraph::cluster_louvain(g)
# IGRAPH clustering multi level, groups: 3, mod: 0.54
# + groups:
# $`1`
# [1] 1 2 3 4
#
# $`2`
# [1] 5 6 15 16 17 18 19
#
# $`3`
# [1] 7 8 9 10 11 12 13 14
Also with graphs with thousands of nodes (like in the OSCA.multisample book), the function can produce clusters that are slightly different between the 2 versions of igraph:
g <- buildSNNGraph(all.sce[[n]], k=10, use.dimred='PCA') # graph with 14948 nodes and 1612016 edges
mbship <- igraph::cluster_walktrap(g)$membership
igraph 1.2.6 produces:
sort(table(mbship))
# mbship
# 6 10 7 14 8 11 12 3 1 2 4 13 5 9
# 23 107 155 225 331 542 623 1100 1325 1708 1852 2040 2194 2723
igraph 1.3.1 produces:
sort(table(mbship))
# mbship
# 13 14 9 12 11 3 4 5 10 1 6 2 7 8
# 23 103 155 225 359 525 623 1096 1324 1707 1855 2040 2197 2716
For example cluster "1"
in 1.2.6 became cluster "10"
in 1.3.1 and lost one node.
Anyways, I'll do the 1->10 replacement in the book.
Thanks @vjcitn @LTLA for the feedback!
H.
https://bioconductor.org/checkResults/3.14/books-LATEST/OSCA.multisample/nebbiolo2-buildsrc.html https://bioconductor.org/checkResults/3.15/books-LATEST/OSCA.multisample/nebbiolo1-buildsrc.html https://bioconductor.org/checkResults/3.16/books-LATEST/OSCA.multisample/nebbiolo2-buildsrc.html
I've started to look into this. Using this issue as a convenient way to share my progress.
First milestone: I managed to extract the code from
tenx-filtered-pbmc3k-4k-8k.Rmd
andusing-corrected-values.Rmd
that lead to this error. This allows me to reproduce the error in about 5 min. on my laptop (Ubuntu 22.04 LTS, 16Gb of RAM):sessionInfo():