Extraction of subclusters from inferCNV output

broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq

Other

557 stars 164 forks source link

Dear all,

I have run inferCNV individually on multiple tumor samples (~1 to 15K cells). As reference, I used a mixture of healthy plasma cells from different samples. I have run each using the following script:

infercnv_obj = CreateInfercnvObject(raw_counts_matrix = seurat_oj[["RNA"]]@counts, annotations_file="annotation_file.tab", delim="\t", gene_order_file="gene_order_file_b.tab", ref_group_names=c("C2", "C3", "C4"))

infercnv_obj = infercnv::run(infercnv_obj, cutoff=0.1, out_dir="output", num_ref_groups=10, denoise=TRUE, HMM=TRUE, analysis_mode = "subclusters")

I am sending the final plot and the 19.HMM prediction plot of a representative case. In this, C1 represents the healthy population in the sample and 57 is the malignant population.

I would like to evaluate intra-tumor heterogeneity based on the inferred CNVs detected in different subclusters detected in each patient sample. However, I am a little confused about the output that I need to used to define the number/identity of the subclusters and the cell IDs for each one. I found a few post post about this, but the information is not very clear to me. Two ideas came up fromt he reading and undertanding the inferCNV output:

The subclusters information and cell IDs can be found at infercnv_obj@tumor_subclusters$subclusters slot. However, my doubt is about the number of subclusters. Most of the samples presented 8 subclusters, including samples from asymptomatic and symptomatic patients. It does not sound right. For the biology perspective, I was expecting to see more in the asymptomatic and la predominant one in the symptomatic samples.
The second way (possibly) is using the hierarchical clustering information. I did not set up a number of k number in the code because it is uncertain the number of subclusters I am expecting per sample.

If you could help me to define the best way to identify the number and the identity of the subclusters in each sample, it would be extremely helpful.

Many thanks in advance for your help!

Regards,

Yurany

Hi @Yurany2908 ,

Which version of infercnv are you running? The subclustering has gone through iterations and some changes to the default settings, so the quality can vary depending on the version you are using. If you look at the file names in the output folder, you should have the subclustering method used in the names of some of the files, which should be either random_trees or leiden. In particular, the newest version available through Github and with the next Bioconductor release (scheduled for next week) has improved Leiden clustering, with more options available for tuning it. One of the difficulties with subclustering is that the optimal settings can vary based on the size of the dataset and its diversity, so some iteration with different settings can be useful (for that, set the up_to_step=15 option so you do not run all the steps past the subclustering until you are satisfied with its quality). To clearly visualize the subclustering done at step 15 without having to continue the run, you can run this snippet:

infercnv_obj_step15 = readRDS("15_tumor_subclustersHMMi6.leiden.infercnv_obj")
subcluster_obj = infercnv_obj_step15

subcluster_obj@reference_grouped_cell_indices = list()
for (grp in names(infercnv_obj_step15@reference_grouped_cell_indices)) {
    for (grp2 in names(infercnv_obj_step15@tumor_subclusters$subclusters[[grp]])) {
        subcluster_obj@reference_grouped_cell_indices[[grp2]] = infercnv_obj_step15@tumor_subclusters$subclusters[[grp]][[grp2]]
    }
}

subcluster_obj@observation_grouped_cell_indices = infercnv_obj_step15@tumor_subclusters$subclusters$all_observations

subcluster_obj@tumor_subclusters = NULL

plot_cnv(subcluster_obj,
         cluster_by_groups=TRUE,
         output_filename="leiden_test",
         out_dir=out_dir,
         write_expr_matrix=FALSE)

For some detail on the leiden_function and leiden_resolution settings, you can check the igraph documentation for them directly.. The other options that affect the Leiden subclustering are leiden_method which by default runs a PCA first, but can be set not to ("simple"), and the z_score_filter that calculate a z-score for genes based on the reference to ignore genes with too much variation in the references (such as MHC genes) in the subclustering.

One thing I notice is that the plot shows 20 ref groups while the code you wrote has the option set to 10, are those the settings for this run?

Regards, Christophe.

broadinstitute / infercnv

Extraction of subclusters from inferCNV output #469