Some issues about reference cells and operation

Dear developer,

Thank you for your kind words and appreciation for this powerful tool. As an infercnv beginner, I have several doubts and areas where I lack understanding. I would like to seek your advice and guidance on these matters to gain more expertise.

In my research, I primarily utilize 10x scRNA-seq data. My goal is to employ infercnv to infer the copy number profile of each cell and subsequently conduct further analysis. As a result, I have tried the following code and obtained the following results:

infercnv_obj = infercnv::run(infercnv_obj = infercnv_obj,
                             out_dir = out_dir,
                             num_threads = 4,
                             cutoff = 0.1, 
                             denoise = TRUE,
                             analysis_mode = "subclusters", 
                             cluster_by_groups = F,
                             k_obs_groups = 4,
                             cluster_references = F,
                             plot_steps = FALSE,
                             HMM = T)

However, I encountered a situation where each individual cell forms its own cluster, and there are also copy number variations observed in the reference cells (which were determined to be normal cells in a public dataset). In addition, a similar situation arises in the observation cells, where after the Seurat pipeline and SingleR annotation, normal cells (immune cells, Stellate) also exhibit the same condition. infercnv 17_HMM_predHMMi6 leiden hmm_mode-subclusters_1

After going through related issues and your responses, I attempted the analysis again, adjusting the leiden_resolution, tumor_subcluster_pval, and some data output settings. Since I aim to obtain the copy number profile for each cell, here is the code for my second attempt and the resulting outcomes:

infercnv_obj = infercnv::run(
                    infercnv_obj,
                    out_dir = out_dir,
                    plot_steps=FALSE,
                    num_threads = 5,

                    # filtering genes
                    cutoff=0.1,

                    # About HClust
                    cluster_by_groups =FALSE, 
                    cluster_references = FALSE,
                    k_obs_groups = 1,

                    # Grouping level for image filtering or HMM predictions
                    analysis_mode = "subclusters",
                    tumor_subcluster_partition_method = "leiden",
                    leiden_resolution = 0.0001,  #Default: 0.05
                    tumor_subcluster_pval = 0.05, #Default: 0.1
                    #inspect_subclusters = TRUE,

                    # Downstream Analyses (HMM or non-DE-masking) based on tumor subclusters
                    HMM = TRUE,
                    HMM_report_by = "cell",

                    # De-noising
                    denoise = TRUE

However, there is still a situation where copy number variation occurs in both the reference cells and normal immune cells. infercnv 17_HMM_predHMMi6 leiden hmm_mode-subclusters

Furthermore, the program has been running for more than a week without finishing, which may pose difficulties in trying multiple parameter sets. Therefore, I would like to seek professional advice.

How can I adjust the parameters to avoid the issue of copy number variations in reference cells and expected normal cells?
Based on the second attempt, am I correctly obtaining the copy number profile for each cell after clustering, or are there any parameters that I have missed?
Lastly, if I intend to modify a parameter like leiden_resolution, how should I proceed? For example, should I reload the previously outputted data from infercnv, or should I clear certain data to ensure smooth execution? Or should it depend on which parameters I want to change? I came across discussions regarding fine-tuning plot adjustments or changing reference cells, but I'm still not clear about modifying the parameters.

I would greatly appreciate your assistance in answering my questions and providing professional advice.

Thank you.

Hi @39652its ,

Could you also share the residual expression figure? The HMM results are based on the residual expression so it is important to compare both to verify if anything wrong might have happened. Looking at the signal in your references it could be that there are 2 populations of cells that should be separated, which can be done with the "num_ref_groups" option. I would also leave the option cluster_references=TRUE to the default so that references are ordered by the clustering.
With the 2nd run and set of options, the results do look much better as the subclustering is not oversplit. To make sure the profiles are good though, I would check as in 1. by comparing the residual expression, and trying to identify why some signal remains in the references.
When you modify an option such as leiden_resolution, infercnv should take care of picking back up the analysis from the last step before the changed options have an effect on its own. The backup up objects generated after each step contain the options that were used to generate them, so when it tries to reload them, your current run's option are compared to it. If any option that affects results by that step is found to be different, infercnv ignores that backup and checks the one from the previous step, until the "most recent common ancestor" is found.

Regards, Christophe.

broadinstitute / infercnv

Some issues about reference cells and operation #538