broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq
Other
566 stars 166 forks source link

broken `hclust_method`: cells of similar profilles not clustering together #463

Open tbrunetti opened 2 years ago

tbrunetti commented 2 years ago

Hello,

I have been using the newest pull from the master branch and everything works well, except now my cells will not cluster together along the y-axis. I have tried setting group_by_cluster = T and group_by_cluster = F and both return the same result. I only have a single sample. An older version of your software, properly clusters this using the same command:

infercnv_obj = CreateInfercnvObject(raw_counts_matrix= sample_counts_matrix,
                                    annotations_file= cell_idents,
                                    delim="\t",
                                    gene_order_file=gene_df,
                                    ref_group_names= NULL)  

infercnv_obj = infercnv::run(infercnv_obj,
                             cutoff=0.1,  # use 1 for smart-seq, 0.1 for 10x-genomics
                             out_dir=outdir_null,  # dir is auto-created for storing outputs
                             cluster_by_groups=T,   # cluster
                             denoise=T,
                             analysis_mode="subclusters",
                             num_threads = 6,
                             output_format = "pdf",
                             HMM=TRUE, resume_mode= F)

Any suggestions of what to try? Thanks!

UPDATE I have been playing around with changing the hclust_method= parameter, and I think that may be broken in the newest/current master branch? So far, I have tried setting it to: ward.D, ward.D2, single, and complete, and none of them change the clustering. Additionally, the default of ward.D2 in an older version of this software, does properly cluster this data set as I expect.

GeorgescuC commented 2 years ago

Hi @tbrunetti ,

Based on the settings you are using and recent changes, I think the issue has to do with the subclustering splitting your cells too much (maybe down to each cell being a subcluster). Once subclusters are defined, an hclust is calculated for each subcluster independently (this is where hclust_method matters), and combined with the others. So if each cell is a subcluster, they will just get appended one after the other to the merged tree based on the iteration order.

How many cells do you have, and what does the preliminary plot look like? You can check the infercnv_obj@tumor_subclusters$subclusters slot for the list of subclusters and their content. If that is indeed the issue, I would try lowering the leiden_resolution parameter. Too much fragmentation in subclusters is also bad for the HMM, so improving that would benefit predictions too.

Regards, Christophe.

tlebchan commented 2 years ago

Hi @tbrunetti ,

I have the same problem, and I have no idea how to fix it. I tried to tune leiden_resolution and tumor_subcluster_pval but the result is absolutely the same. Examining infercnv_obj@tumor_subclusters$subclusters showed, that algorithm uses each cell as a separate cluster. @GeorgescuC do you know something about this problem?

Thank you, Gleb

tbrunetti commented 2 years ago

Hi @tbrunetti ,

Based on the settings you are using and recent changes, I think the issue has to do with the subclustering splitting your cells too much (maybe down to each cell being a subcluster). Once subclusters are defined, an hclust is calculated for each subcluster independently (this is where hclust_method matters), and combined with the others. So if each cell is a subcluster, they will just get appended one after the other to the merged tree based on the iteration order.

How many cells do you have, and what does the preliminary plot look like? You can check the infercnv_obj@tumor_subclusters$subclusters slot for the list of subclusters and their content. If that is indeed the issue, I would try lowering the leiden_resolution parameter. Too much fragmentation in subclusters is also bad for the HMM, so improving that would benefit predictions too.

Regards, Christophe.

The preliminary data looks the same as the final data and it occurs for any type of input I use. I tried giving it as much as 8000 cells, and othertimes a few as 1000 cells, each representing different samples and not one time have I had it cluster the way infercnv used to do. I ended up using an old version of infercnv just to get the hclust to work because it does work on a previous version, just not on the current master branch version.

@tlebchan Yeah, I think we both have the same problem. I tried messing around with the leiden_resolution too without any changes or effect on the hclust

GeorgescuC commented 1 year ago

Hi @tbrunetti @tlebchan ,

Based on @tlebchan examining the infercnv_obj@tumor_subclusters$subclusters data, the issue does appear to be related to over-splitting of subclusters down to 1 cell per cluster.

In the recent version, we have changed how the Leiden algorithm is called to use the R implementation in "igraph" because the "leidenalg" implementation that called on Python started to produce errors and was not supported anymore. With this, the default scoring metric used in the algorithm changed from "modularity" to "CPM" (which is theoretically an improvement). One of the differences between those is that the value of the "resolution" treshold required to obtain a certain level of splitting changed. A resolution of 1 with "CPM" can roughly be replaced by 0.05-0.1 when using "modularity", but the new default setting of 0.05 might not work well for your datasets. Besides that, we also run a PCA to define the shared nearest neighbor graph to run the Leiden algorithm on by default now. If you wish to use the older method without PCA, you can set leiden_method="simple". One of the results of these changes is a tendency to generate a number of very small clusters, which usually contain the noisiest cells.

tumor_subcluster_pval only affects subclustering when using the random trees method.

If updating to make sure you have the latest version of the code does not work, could you share a small example dataset and the options you used to produce the issue so I can debug it?

Regards, Christophe.