SCA-IRCM / SingleCellSignalR

R package for Bioconductor submission
29 stars 12 forks source link

Error in checkForRemoteErrors(val) : 3 nodes produced errors; first error: subscript out of bounds #12

Open FerdinandoPucci opened 3 years ago

FerdinandoPucci commented 3 years ago

Clustering stops after several hours of computation with that error. The command was:

> ND_clusters <- clustering(ND_merged, n=30)
Estimating the number of clusters
Error in checkForRemoteErrors(val) : 
  3 nodes produced errors; first error: subscript out of bounds

> str(ND_merged)
'data.frame':   21627 obs. of  11636 variables:
...
> ND_merged[1:5,1:5]
                     V2       V3 V4 V5 V6
0610005C13Rik        NA       NA NA NA NA
0610007P14Rik 0.5108256 0.268264  0  0  0
0610009B22Rik 0.0000000 0.000000  0  0  0
0610009L18Rik 0.0000000 0.000000  0  0  0
0610009O20Rik 0.0000000 0.000000  0  0  0

Thanks for any advice

SCA-IRCM commented 3 years ago

Hello,

First you can try this before re-launching the clustering function:

Feel free to contact me again if it doesn't work.

Thanks for using SingleCellSignalR!

SCA

FerdinandoPucci commented 3 years ago

Thanks SCA!

I removed the NAs with ND_merged[is.na(ND_merged)] <- 0 and there are no zero-filled lines > sum(apply(ND_merged_norm, 1, sum)==0) [1] 0 data_prepare() generated ND_merged_norm from ND_merged.

However: > ND_clusters <- clustering(ND_merged_norm, n=10) Estimating the number of clusters Error in checkForRemoteErrors(val) : 3 nodes produced errors; first error: vector memory exhausted (limit reached?) In addition: Warning messages: 1: In for (i in 1L:d2) { : closing unused connection 5 (<-localhost:11977) 2: In for (i in 1L:d2) { : closing unused connection 4 (<-localhost:11977) 3: In for (i in 1L:d2) { : closing unused connection 3 (<-localhost:11977)

Even if: > mem.maxVSize() [1] 131072

Is it normal that clustering() requires so much RAM/swap? Thanks

SCA-IRCM commented 3 years ago

Hi, The clustering() function uses SIMLR to estimate the number of clusters and to clusterize the cells. It is a very performant package but it is also very greedy for large datasets (over 5000 cells). For your dataset and on your system you can try to set the n.cluster argument to a random value in order to skip the Estimating the number of clusters part and set the method argument to "kmeans". This should work and produce a 2D t-SNE map on wich you can visualize your data and estimate yourself the number of clusters. Then re-run the analysis setting n.cluster to the number you estimated. It is a bit tedious but it sould work :).

Thanks again for using SingleCellSignalR!

SCA

FerdinandoPucci commented 3 years ago

Thank you so much, it worked. I estimated the number of clusters with Loupe browser. > ND_clusters <- clustering(ND_merged, n.cluster=15, method="kmeans") 15 clusters detected cluster 1 -> 220 cells cluster 2 -> 398 cells cluster 3 -> 220 cells cluster 4 -> 43 cells cluster 5 -> 100 cells cluster 6 -> 1914 cells cluster 7 -> 2694 cells cluster 8 -> 2 cells cluster 9 -> 2586 cells cluster 10 -> 188 cells cluster 11 -> 2419 cells cluster 12 -> 239 cells cluster 13 -> 9 cells cluster 14 -> 521 cells cluster 15 -> 83 cells Warning message: Quick-TRANSfer stage steps exceeded maximum (= 581800) However, the cell_signaling() function does not find any cellular interaction. I would think this is quite unlikely as all these cells come from the same organ (lymph node). I am wondering if the merging with bulk seq data is what causes that. I have 2 "wet lab" clusters (purified cell populations sequenced in bulk) that I need to add to the scRNA data. This is because it is known that 10x and other scRNA seq procedures miss the more "delicate" cell types (they get destroyed in the GEMM phase) such as macrophages/dendritic cells, senescent cells, ... But bulk RNA seq values are much higher than scRNA seq: > range(Bulk.ND) [1] 0 2108283 > range(ND_matrix_norm) #scRNA dataset [1] 0.00000 9.98179

Maybe I should use the Zscore or similar?

Thank you

SCA-IRCM commented 3 years ago

Hi, I tend to think that merging bulk and single cell RNAseq data is not a good idea, you should analyze both datasets separately (see https://www.thno.org/v10p4383 for a method using bulk RNAseq). However It is probably not the reason why you don't see any interaction. By default the cell_signaling() function computes only the "pure" paracrine interactions (meaning that the corresponding receptor is not expressed by the cell that expresses the ligand, see supplementary figures in the paper for more details), it usually happens if the cell types are close and it seems to be your case. You can see this if you set the int.type argument to "autocrine", it should return a lot of interactions. If you're interested (as I think you are) in the communication between the different cell types you can try to play with the tol argument (see details). For example if you set tol=0.05 you allow 5% of the cells expressing the ligand to also express the receptor. Kepp me posted if it solves your problem.

Hope this helps.

SCA

FerdinandoPucci commented 3 years ago

It worked! Thank you! Very few interactions with tol=0.05, I will try increase it.

However, it does not consider the 2 extra clusters (bulk RNA seq data of purified cell subsets) I manually added to the list generated by clustering().
> ND_clusters_copy$numbers [1] 998 294 355 821 1081 935 781 933 10 824 1649 303 443 307 1900 1 1 It does look for DGE for clusters 16 and 17: ... No such file as table_dge_cluster 17.txt in the cluster-analysis folder ... I guess not finding the dge table is normal as it happend also in the vignette on bioconductor. But: ... 0 No significant interaction found from cluster 1 to cluster 15 0 No significant interaction found from cluster 2 to cluster 1 ... A bit more info: >ND_signal <- cell_signaling(data = ND_merged, genes = rownames(ND_merged), cluster = ND_clusters_copy$cluster, write = FALSE) > nrow(ND_merged) [1] 21532 > length(rownames(ND_merged)) [1] 21532 > ncol(ND_merged) [1] 11636 > length(ND_clusters_copy$cluster) [1] 11636 > nrow(ND_clusters_copy$'t-SNE') [1] 11636 > length(ND_clusters_copy$numbers) [1] 17 > ND_clusters_copy$cluster[(length(ND_clusters_copy$cluster)-5):length(ND_clusters_copy$cluster)] V11924 V11925 V11926 V11927 SSM MSM 4 15 7 11 16 17 I hope I edited the list generated by clustering() in the right way. The number of columns of data (11636) matches the length of the cluster vector, as does the number of rows with rownames, as specified in the documentation. Maybe I do not have official HUGO gene symbols?

Thanks for helping!

FerdinandoPucci commented 3 years ago

Increasing tol still does not detect enough interactions, any suggestion? Thanks

SCA-IRCM commented 3 years ago

Hi, You can check if the gene names you have match the gene names in LRdb as it is accessible when the package is loaded (just type LRdb). You should also try to set method = autocrine and see if you have a lot of interactions. Depending on this I can give you 2 explanations:

Keep me posted on the result you get with method = autocrine.

Tcheers,

SCA