broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq
Other
557 stars 164 forks source link

Some confusion about cluster_by_groups #509

Open Rui-Jing opened 1 year ago

Rui-Jing commented 1 year ago

Thank you and your team for inventing such a user-friendly tool. I encountered some issues while using infercnv, which left me very confused.

Firstly, in the CreateInfercnvObject function, if a filtered counts matrix from Seurat is used as input instead of the raw matrix, will some information be lost?

Secondly, I ran infercnv twice on the same dataset separately, and noticed some inconsistencies in the results. The code for the two runs is provided below, with the only difference being the cluster_by_groups parameter changing from True to False. My single-cell data comes from 10X single-cell sequencing of a single sample of brain tumor. In this case where there is no multiple patient source, can I set the cluster_by_groups parameter to True? Will the cluster_by_groups parameter lead to the appearance of some false positive signals, and which result should I believe in, given the inconsistency between True and False?"

library(Seurat)
library(infercnv)

#infercnv for P60 sample construction
P60 <-readRDS(file = "/data/P60/outs/P60_final_annotated.rds")

#Extracting counts from P60
counts <- GetAssayData(object = P60, slot = "counts")

#Creat infercnv object
infercnv_obj = CreateInfercnvObject(raw_counts_matrix=counts,
                    annotations_file="/home/xpxp/P60_annotation_new.txt",
                    delim="\t",
                    gene_order_file="/home/xpxp/mm_geneOrderingFile.txt",
                    ref_group_names=c("Microglia",'Macrophages'))

## ** First infercnv:: information**
#Run infercnv
infercnv_obj = infercnv::run(infercnv_obj,
                             cutoff=0.1,
                             out_dir="/data/P60_Infercnv/",
                             **cluster_by_groups=True,**
                             denoise=T,
                             analysis_mode='subclusters',
                             HMM=TRUE,
                             HMM_type='i6',
                 tumor_subcluster_partition_method = "random_trees",
                             num_threads=15)

## ** Second infercnv:: information**
infercnv_obj = infercnv::run(infercnv_obj,
                             cutoff=0.1,
                             out_dir="/data/P60_Infercnv_none_by_groups",
                             **cluster_by_groups=False,**
                             denoise=T,
                             analysis_mode='subclusters',
                             HMM=TRUE,
                             HMM_type='i6',
                 tumor_subcluster_partition_method = "random_trees",
                             num_threads=15)

The inconsistent results are below : 1676863245840 1676863527314

The third question is, if I want to infer the relationship between cell subpopulations using the CNV method, what clustering method do you recommend?

GeorgescuC commented 1 year ago

Hi @Rui-Jing ,

1) For the question about Seurat filtering, it should not be an issue. There will be fewer cells/genes in the matrix you provide, but most of those would be filtered by infercnv as well if you used the appropriate settings as they should be bad quality cells and non expressed genes. The one thing I would change when running infercnv if you already filtered your matrix in Seurat is to set the cutoff argument in CreateInfercnvObject to 0 so no additional filtering is done as it should not be needed.

2) The cluster_by_groups option makes it so that the annotation groups you provide as input are each analyzed separately, so subclusters will never be a mix of 2 annotations. If you set the option to false, all observations are first pooled together and then clustered. Regarding the difference you see in results there are 2 things. First, when using the random_trees method for subclustering, there is a tendency to always generate 4 subclusters. If in the case where cluster_by_groups is False, you have more than 4 clonal populations, one subcluster at least will be a mix of different clones, so there will be no correct prediction possible for all cells within that cluster, as at least one clone won't match the prediction. There is however a much bigger issue if you used the exact code you shared. When you first execute the run() method with cluster_by_groups set to True, the contents of the object that is returned are different than after freshly creating the object. The most important of which is the expression data stored in the object that is modified throughout the analysis. When you execute the second run() call, the starting expression matrix is the end result of the first run, so it is not a raw counts matrix anymore and processing is done again on a matrix with different properties. To get around that, you can either store the return of run() in a different object, or recreate the object from scratch in between both runs.

In the case where all data is from a single patient, if your annotations are cell types for example, it would still be worth keeping the cluster_by_groups option to True as it is relevant information to distinguish them.

Overall, an important thing to do, especially in cases where you are unsure about the results, but should still always aim to do, is compare the HMM figures with the residual expression figures and see if the results make sense in comparison to each other. If you don't see any signal in the residual expression figure but have HMM predictions, I would be concerned about false positives, which could happen if subclusters are too small and noise is interpreted as signal. If the residual expression figure has clear signal but the HMM predictions do not identify anything, you might be in the opposite case where clusters are too big and combine different clones leading the HMM to see conflicting signals.

3) As I mentioned in the previous point, the random_trees method has a tendency to always subcluster into 4 clusters, so in most cases it is not ideal. The default method, using Leiden clustering, is significantly faster and does not have this drawback. It however is best to inspect subclusters to tweak the resolution argument as needed depending on the diversity and size of your dataset. Because subclustering with the Leiden algorithm does not give a direct relationship between the subclusters, you can run an hclust() on one cell from each subcluster with the HMM results or on the average residual expression profiles of subclusters to have a hierarchy.

Regards, Christophe.

Rui-Jing commented 1 year ago

I'm sorry to bother you again. Regarding the sentence you mentioned earlier(Because subclustering with the Leiden algorithm does not give a direct relationship between the subclusters, you can run an hclust() **on one cell from each subcluster** with the HMM results or **on the average residual expression** profiles of subclusters to have a hierarchy.), could you give me some examples? I'm not sure which file to use for clustering.

Thanks again.

GeorgescuC commented 1 year ago

Hi @Rui-Jing ,

Here is a basic version of both:

hmm_obj = readRDS("20_HMM_pred.repr_intensitiesHMMi6.leiden.hmm_mode-subclusters.Pnorm_0.infercnv_obj")

sampled_indices = c()
sampled_names = c()
for (grp in c(names(hmm_obj@observation_grouped_cell_indices), "all_observations")) {
    for (grp2 in names(hmm_obj@tumor_subclusters$subclusters[[grp]])) {
        sampled_indices = c(sampled_indices, hmm_obj@tumor_subclusters$subclusters[[grp]][[grp2]][1])
        sampled_names = c(sampled_names, grp2)
    }
}

to_cluster = hmm_obj@expr.data[, sampled_indices, drop=FALSE]
colnames(to_cluster) = sampled_names
d = parallelDist::parallelDist(t(to_cluster))
hc = hclust(d)
plot(hc, main="HMM based hclust")
residual_obj = readRDS("run.final.infercnv_obj")

med_mat = NULL
sampled_names = c()
for (grp in c(names(residual_obj@observation_grouped_cell_indices), "all_observations")) {
    for (grp2 in names(residual_obj@tumor_subclusters$subclusters[[grp]])) {
        if(is.null(med_mat)) {
            med_mat = apply(residual_obj@expr.data[, residual_obj@tumor_subclusters$subclusters[[grp]][[grp2]], drop=FALSE], 1, median)
            sampled_names = c(sampled_names, grp2)
        }
        else {
            med_mat = cbind(med_mat, apply(residual_obj@expr.data[, residual_obj@tumor_subclusters$subclusters[[grp]][[grp2]], drop=FALSE], 1, median))
            sampled_names = c(sampled_names, grp2)
        }
    }
}

colnames(med_mat) = sampled_names
d = parallelDist::parallelDist(t(med_mat))
hc = hclust(d)
plot(hc, main="residual expression based hclust")

Regards, Christophe.

tingxie2020 commented 1 year ago

Hi @Rui-Jing Could you please let me know how you plot the umap figures in your Feb.19 post? @GeorgescuC could you please let me know if you knows? Thank you!

GeorgescuC commented 1 year ago

Hi @tingxie2020 ,

You can find how to plot this type of figure with infercnv results using Seurat on the wiki.

Regards, Christophe.