subcluster issues - Githubissues

broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq

Other

557 stars 164 forks source link

subcluster issues #528

Open callaL opened 1 year ago

callaL commented 1 year ago

Thank you for developing such an excellent tool.

I have some questions about subcluster that I would like to ask you? I have run each using the following script: cluster_by_groups=F, scale_data=F, denoise=TRUE, HMM=TRUE, analysis_mode="subclusters", tumor_subcluster_partition_method='leiden', num_threads = 10, leiden_resolution = 0.5, output_format='pdf' The drawing of the final result is like this：

1、Is this picture drawn according to subcluster?The result of the group in this graph looks the same as infercnv.20_HMM_predHMMi6.leiden.hmm_mode-subclusters.Pnorm_0.5.repr_intensities.pdf

2、How can the results of the subcluster of inferCNV be displayed in inferCNv.pdf?For example, adding a comment bar on the left side of the graph?

GeorgescuC commented 1 year ago

Hi @callaL ,

Both pictures looking very similar is a good thing! The residual expression is the basic end output of infercnv, where each cell is independently processed. The HMM is a way of segmenting CNVs into more precise regions by taking advantage of clonal expansions identified through subclustering, this allows to define boundaries for start and end despite the noise at the single cell level. The HMM also assigns a fold change level to the CNVs, to easily distinguish between single and double gain/loss (or even higher gain). The limitation of the HMM that needs to be kept in mind is that as it looks at a group of cells all together, you only have 1 set of results for all those cells, which is why we need to run the subclustering in most cases to split cells into the different clones. So in this cases, where both results look similar, it indicates that the subclustering worked well and properly separated the different (2 in this case) clones, so you can have confidence in the HMM results.
This is actually a feature that was added in the very latest version of infercnv that is currently only on Bioc-devel, 1.15.3. The new release of R and BioConductor will be released close to the end of the month, so they will be available on the stable branch then. In the meantime, you can update infercnv from Github directly to access the feature using:
```
library("devtools")
devtools::install_github("broadinstitute/infercnv")
```
With this version, the left most color bar on the observation plot will indicate the subclusters, and there is also a helper method to plot the figure with subclusters as annotations, which adds a black bar between each of them for more visual clarity. One note though, based on the Leiden resolution setting you are using, you seem to be using a version from before the Leiden library change, so you may need to either adjust the leiden_resolution lower, or change the leiden_function back to "modularity". With the new subcluster visualization though, you should easily be able to tell if they are good or not.

Regards, Christophe.

callaL commented 1 year ago

Thank you for your reply.The similarity between the two images indicates that the subclustering worked well,but it doesn't mean that the first image was drawn according to subcluster, does it?I have another question, why does HMM predict using the 'preliminary infercnv object' instead of the denoised result?

GeorgescuC commented 1 year ago

Hi @callaL ,

Unless you manually edited the internals of the infercnv object, each subcluster should always be a monophyletic group in the dendrogram, so all cells from the same subcluster should be contiguous. This is the result of either individual hclust objects that are calculated for each subcluster and then merged as dendrograms into an overarching one, or an initial hclust is cut into subclusters. If the subclusters were not contiguous, it would also in all likelihood be obvious on the HMM figure because all cells from a subcluster have the same values, so you would see the different patterns mixed up.

The HMM uses the non denoised results because the distribution of values is compared to that of the simulated data, but denoising changes what the distribution of values looks like as it zeroes all values within a certain range of the center.

Regards, Christophe.

Nisanity007 commented 1 year ago

Hi @callaL ,

Both pictures looking very similar is a good thing! The residual expression is the basic end output of infercnv, where each cell is independently processed. The HMM is a way of segmenting CNVs into more precise regions by taking advantage of clonal expansions identified through subclustering, this allows to define boundaries for start and end despite the noise at the single cell level. The HMM also assigns a fold change level to the CNVs, to easily distinguish between single and double gain/loss (or even higher gain). The limitation of the HMM that needs to be kept in mind is that as it looks at a group of cells all together, you only have 1 set of results for all those cells, which is why we need to run the subclustering in most cases to split cells into the different clones. So in this cases, where both results look similar, it indicates that the subclustering worked well and properly separated the different (2 in this case) clones, so you can have confidence in the HMM results.

This is actually a feature that was added in the very latest version of infercnv that is currently only on Bioc-devel, 1.15.3. The new release of R and BioConductor will be released close to the end of the month, so they will be available on the stable branch then. In the meantime, you can update infercnv from Github directly to access the feature using:
library("devtools")
devtools::install_github("broadinstitute/infercnv")
With this version, the left most color bar on the observation plot will indicate the subclusters, and there is also a helper method to plot the figure with subclusters as annotations, which adds a black bar between each of them for more visual clarity. One note though, based on the Leiden resolution setting you are using, you seem to be using a version from before the Leiden library change, so you may need to either adjust the leiden_resolution lower, or change the leiden_function back to "modularity". With the new subcluster visualization though, you should easily be able to tell if they are good or not.

Regards, Christophe.

Dear Christophe， It is a very excellent package for cancer research, and I have a question about leiden_resolution. How to choose a proper threshold or just set it empirically. Best wishes QiHuang

GeorgescuC commented 1 year ago

Hi @Nisanity007 ,

At this time there is no automated method to select the best value, but the latest version, 1.16.0, has features to help you inspect the subclustering results to determine if they are good or if the resolution needs adjustment. If you run infercnv with the option up_to_step=15, the run will stop after step 15 which is the subclustering. Then you can check the preliminary plot and the subclustering plot that are now generated by default and display subclustering information. On the preliminary plot, the left most color bar will indicate subclusters, and on the subcluster plot, black bars will seperate the subclusters instead of the annotations groups. If the subclustering looks good, you can then remove the up_to_stepsetting and finish the run. If the subclustering need adjustments, you can simply adjust the leiden_resolution parameter and rerun infercnv with the up_to_stepoption, only the subclustering step will be rerun. I have a tutorial that is ready and will soon be made available to explain this more easily.

Regards, Christophe.

Nisanity007 commented 1 year ago

At this time there is no automated method to select the best value, but the latest version, 1.16.0, has features to help you inspect the subclustering results to determine if they are good or if the resolution needs adjustment. If you run infercnv with the option up_to_step=15, the run will stop after step 15 which is the subclustering. Then you can check the preliminary plot and the subclustering plot that are now generated by default and display subclustering information. On the preliminary plot, the left most color bar will indicate subclusters, and on the subcluster plot, black bars will seperate the subclusters instead of the annotations groups. If the subclustering looks good, you can then remove the up_to_stepsetting and finish the run. If the subclustering need adjustments, you can simply adjust the leiden_resolution parameter and rerun infercnv with the up_to_stepoption, only the subclustering step will be rerun. I have a tutorial that is ready and will soon be made available to explain this more easily.

Thank you for your great advice. I have try it and set the leiden_resolution as 0.00001(fig1), 0.000005(fig2), the picture is followed. To which, whether the leiden_resolution needs adjust again, such as 0.000001, 0.0000001 or further. And I try to run infercnv with the option tumor_subcluster_partition_method = "random_trees", which seems better(fig3), however, the subcluster looks a bit confused(fig4).

fig1 0 00001infercnv fig2 0 000005infercnv fig3 fig4 infercnv_subclusters

GeorgescuC commented 1 year ago

Hi @Nisanity007 ,

If you are using version 1.16.0+ of infercnv, then based on fig1/2, the leiden resolution you used is now too low and all cells are kept together in the same cluster. Based on datasets I ran tests with, a good starting range of leiden resolution is 0.05 to 0.01, although the value varies by dataset size, diversity and quality. For an example, there is now a vignette and a video tutorial that I just released that you can find linked on the wiki If additional refinement of subclusters seems needed but increasing the leiden resolution more leads to too much fragmentation, given how your data looks, you can try increasing the leiden resolution but at the same time enable the per_chr_hmm_subclusters option. This will mean you have more, smaller, subclusters, but the per chromosome subclustering will allow the HMM to run on a per chromosome basis with larger subclusters specific to that chromosome, then give a consensus for each overall subclusters.

For figure 3, the clustering itself looks good, however the potential problem with the random trees method (besides run time) is whether enough of the branches have been split.

Figure 4 looks weird because it is actually a plot generated at step 7 when the subclustering happens with the random_trees method, which is before some of the key steps that make the figure readable. I should probably delay this plotting when using the random trees method to right before the HMM just as for the Leiden method. An easy way to get the better version of this plot without rerunning the analysis is as follow:

out_dir = "output/path/for/the/random/trees/run/"
prelim_obj = readRDS(paste0(out_dir, "preliminary.infercnv_obj"))
plot_subclusters(prelim_obj, out_dir=out_dir, output_filename="preliminary_obj_subclusters")

This should generated a file "preliminary_obj_subclusters.png" where the residual expression is plotted with the random trees subclusters as annotations.

Regards, Christophe.

Nisanity007 commented 1 year ago

你好@Nisanity007,

如果您使用的是 infercnv 的 1.16.0+ 版本，那么基于 fig1/2，您使用的 leiden 分辨率现在太低并且所有单元格都放在同一个集群中。根据我运行测试的数据集，leiden 分辨率的良好起始范围是 0.05 到 0.01，尽管该值因数据集大小、多样性和质量而异。例如，现在有一个我刚刚发布的小插图和一个视频教程，您可以在 wiki 上找到链接您可以尝试增加莱顿分辨率，但同时启用per_chr_hmm_subclusters选项。这将意味着您有更多、更小的子集群，但是每个染色体的子集群将允许 HMM 在每个染色体的基础上运行，具有特定于该染色体的更大的子集群，然后为每个整体子集群提供共识。

对于图 3，聚类本身看起来不错，但是随机树方法的潜在问题（除了运行时）是是否有足够的分支被拆分。

图 4 看起来很奇怪，因为它实际上是在使用 random_trees 方法进行子聚类时在第 7 步生成的图，这是在使图可读的一些关键步骤之前。当在 HMM 之前使用随机树方法时，我可能应该延迟此绘图，就像 Leiden 方法一样。在不重新运行分析的情况下获得该图的更好版本的简单方法如下：
out_dir = "output/path/for/the/random/trees/run/"
prelim_obj = readRDS(paste0(out_dir, "preliminary.infercnv_obj"))
plot_subclusters(prelim_obj, out_dir=out_dir, output_filename="preliminary_obj_subclusters")
这应该生成一个文件“preliminary_obj_subclusters.png”，其中残差表达式与随机树子簇一起作为注释绘制。

问候，克里斯托夫。 Thank you very much for your constructive suggestion！I will try it again！

andynkili commented 5 months ago

@GeorgescuC ,

I have the same issue where infercnv_sublcusters.png looks weird for different samples infercnv_subclusters I did out_dir = "output/path/for/the/random/trees/run/" prelim_obj = readRDS(paste0(out_dir, "preliminary.infercnv_obj")) plot_subclusters(prelim_obj, out_dir=out_dir, output_filename="preliminary_obj_subclusters") and obtains preliminary_obj_subclusters.png

The issue is that the cell grouping is always the same for all samples: random_tree

Here is how ran infercnv (1.19.1):

infercnv_obj = infercnv::run(infercnv_obj,
                               cutoff=0.1,  # use 1 for smart-seq, 0.1 for 10x-genomics
                               out_dir=paste(paste(unique(x$orig.ident), collapse = "_"),"_RefNormal_win101",sep='_'),
                               num_threads = 15,window_length = 101,
                               HMM_type = 'i6',BayesMaxPNormal = 0.5,
                               no_plot= F , plot_steps = T , plot_probabilities = F, save_rds = T, save_final_rds = T, no_prelim_plot = F,
                               cluster_by_groups=F,   # cluster
                               tumor_subcluster_partition_method='random_trees',
                               denoise=T,cluster_references=T,
                               analysis_mode = "subclusters",
                               HMM=T)

I need the correct cell groupings to plot CNV phytlogenetic tree using Uphyloplot2.

Best, Andy