broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq
Other
557 stars 164 forks source link

Extraction of subclusters from inferCNV output #469

Open Yurany2908 opened 1 year ago

Yurany2908 commented 1 year ago

Dear all,

I have run inferCNV individually on multiple tumor samples (~1 to 15K cells). As reference, I used a mixture of healthy plasma cells from different samples. I have run each using the following script:

infercnv_obj = CreateInfercnvObject(raw_counts_matrix = seurat_oj[["RNA"]]@counts, annotations_file="annotation_file.tab", delim="\t", gene_order_file="gene_order_file_b.tab", ref_group_names=c("C2", "C3", "C4"))

infercnv_obj = infercnv::run(infercnv_obj, cutoff=0.1, out_dir="output", num_ref_groups=10, denoise=TRUE, HMM=TRUE, analysis_mode = "subclusters")

I am sending the final plot and the 19.HMM prediction plot of a representative case. In this, C1 represents the healthy population in the sample and 57 is the malignant population.

Screen Shot 2022-10-20 at 11 53 15 AM Screen Shot 2022-10-20 at 11 53 41 AM

I would like to evaluate intra-tumor heterogeneity based on the inferred CNVs detected in different subclusters detected in each patient sample. However, I am a little confused about the output that I need to used to define the number/identity of the subclusters and the cell IDs for each one. I found a few post post about this, but the information is not very clear to me. Two ideas came up fromt he reading and undertanding the inferCNV output:

If you could help me to define the best way to identify the number and the identity of the subclusters in each sample, it would be extremely helpful.

Many thanks in advance for your help!

Regards,

Yurany

GeorgescuC commented 1 year ago

Hi @Yurany2908 ,

Which version of infercnv are you running? The subclustering has gone through iterations and some changes to the default settings, so the quality can vary depending on the version you are using. If you look at the file names in the output folder, you should have the subclustering method used in the names of some of the files, which should be either random_trees or leiden. In particular, the newest version available through Github and with the next Bioconductor release (scheduled for next week) has improved Leiden clustering, with more options available for tuning it. One of the difficulties with subclustering is that the optimal settings can vary based on the size of the dataset and its diversity, so some iteration with different settings can be useful (for that, set the up_to_step=15 option so you do not run all the steps past the subclustering until you are satisfied with its quality). To clearly visualize the subclustering done at step 15 without having to continue the run, you can run this snippet:

infercnv_obj_step15 = readRDS("15_tumor_subclustersHMMi6.leiden.infercnv_obj")
subcluster_obj = infercnv_obj_step15

subcluster_obj@reference_grouped_cell_indices = list()
for (grp in names(infercnv_obj_step15@reference_grouped_cell_indices)) {
    for (grp2 in names(infercnv_obj_step15@tumor_subclusters$subclusters[[grp]])) {
        subcluster_obj@reference_grouped_cell_indices[[grp2]] = infercnv_obj_step15@tumor_subclusters$subclusters[[grp]][[grp2]]
    }
}

subcluster_obj@observation_grouped_cell_indices = infercnv_obj_step15@tumor_subclusters$subclusters$all_observations

subcluster_obj@tumor_subclusters = NULL

plot_cnv(subcluster_obj,
         cluster_by_groups=TRUE,
         output_filename="leiden_test",
         out_dir=out_dir,
         write_expr_matrix=FALSE)

For some detail on the leiden_function and leiden_resolution settings, you can check the igraph documentation for them directly.. The other options that affect the Leiden subclustering are leiden_method which by default runs a PCA first, but can be set not to ("simple"), and the z_score_filter that calculate a z-score for genes based on the reference to ignore genes with too much variation in the references (such as MHC genes) in the subclustering.

One thing I notice is that the plot shows 20 ref groups while the code you wrote has the option set to 10, are those the settings for this run?

Regards, Christophe.