broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq
Other
565 stars 166 forks source link

parsing output to extract tumor cells #222

Closed igordot closed 2 years ago

igordot commented 4 years ago

I was trying to use inferCNV to separate tumor from non-tumor cells. I have several samples and there are definitely immune (not tumor) cells present. I did not specify any groups as reference. This was my command:

infercnv::run(
  infercnv_obj = infercnv_obj, cutoff = 0.1, min_cells_per_gene = 50,
  out_dir = "./infercnv-out", num_threads = 16,
  analysis_mode = "subclusters",
  denoise = TRUE, HMM = TRUE,
  plot_steps = FALSE, no_prelim_plot = TRUE, png_res = 300
)

I think subclusters mode is the important setting for my use-case. This is my final infercnv.png:

infercnv

There are clearly some CNVs present. However, to find different clusters, I think I am supposed to be using step 19 output. This is my infercnv.19_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.Pnorm_0.5.repr_intensities.png:

infercnv 19_HMM_predHMMi6 rand_trees hmm_mode-subclusters Pnorm_0 5 repr_intensities

Visually, I can again see some clear clusters. Based on the right-side dendrogram, there are 8 clusters. However, looking at infercnv.19_HMM_predHMMi6.rand_trees.hmm_mode-subclusters.Pnorm_0.5.repr_intensities.observation_groupings.txt, there are more than that.

I thought the information I am looking for would be in the .dat files. I checked cnv_regions.dat, but the state ranges from 2-5. The documentation says it should be 0-3 in 0.5 steps.

If I am trying to classify tumor cells by the presence of CNVs or just different CNV clusters, what is the best approach and which files should I be using?

GeorgescuC commented 4 years ago

Hi @igordot ,

The values you see in the observation_groupings.txt are the ones displayed in the 2 color bars. The first is the groupings if split in "k_obs_groups" (much more basic, and is independent of the subclusters option), and the second is your observation annotation. For the cnv_regions.dat, we updated the code to be consistent between the HMM step and the Bayesian filtering step and use the [1;6] range with step size of 1 (the 6 states of the HMM), where the neutral state is 3. I will check the wiki to update the information there.

To retrieve the subclusters information, you can either parse the observation_dendrogram.txt output in newick format to split at every none zero branch, or you can retrieve the list of cells from the R object directly. To do so, simply read from the infercnv_obj@tumor_subclusters$subclusters slot in R.

As for classifying your cells as tumor or normal, keep in mind that the infercnv results when not providing a set of cells as reference take the average of all cells as the reference, so depending on the ratio of normal to mutant, the average will more like one or the other. I also notice that your cells cluster very well with their annotation, are those defined by experimental runs, cell type, or something else?

Regards, Christophe.

igordot commented 4 years ago

Thanks for the clarification!

I checked infercnv_obj@tumor_subclusters$subclusters and indeed I found the clusters with cell names in there. Should I be using the final object or step 19? Mine are identical, but maybe that is not always the case. In the plots I posted, the cell order looks different.

I probably should've included this question in the original post, but is it better to use 3-state HMM instead of 6-state? If my primary goal is to identify any aberrations, rather than exact copy numbers, does the extra granularity help? It also seems like it would perform better without specifying reference cells.

To answer your cluster question, the clusters are based on unbiased clustering. Some of the tumors form distinct clusters, so it's not surprising to see copy number data cluster them as well.

GeorgescuC commented 4 years ago

Hi @igordot ,

The cells within the 8 clusters should be identical between the 2 steps. The reason why they are not displayed in the same order in the HMM plot and the final plot is due to the fact that in the HMM step, the dendrogram is recalculated for the display only (all cells in a given subcluster are identical at this point, so there's no way for subclusters to get mixed).

HMM i3 may perform better in telling you where there are differences when not specifying reference cells. However for both models, you need to keep in mind that since there is no reference, the cells showing signal could in theory be the normal ones, while the ones showing no signal could be the mutant if the CNV is shared across the majority of cells. In your case the main case to figure out seems to be chr6, where loss and gain signals seem to cancel out as they are in a similar proportion of the cells (It is possible that the gain signal is the base level, the no signal is 1 loss, and the loss signal is double loss. You should be able to verify this by checking if there's any raw read counts that are 0 across a region to define double loss)

Thanks for the answer!

Regards, Christophe.

igordot commented 4 years ago

Thank you for the help!

Since I initially posted the issue, I repeated the analysis. I marked immune cells, which are fairly easy to separate and should be normal, as reference cells. Based on the heatmap, some regions improved. For example, chr8 has a very obvious amplification now. For some reason, chr6 still shows aberrations even in the reference. I am posting the output below.

infercnv

Based on the HMM output, the results are much better. Unlike last time, the clusters are labeled by patient. Each one produces a distinct pattern, as would be expected. They split into normal and altered subsets. Some only have a normal profile, though.

infercnv 19_HMM_predHMMi6 rand_trees hmm_mode-subclusters Pnorm_0 5 repr_intensities

GeorgescuC commented 4 years ago

Hi @igordot ,

Those results look much better indeed! If you can separate your immune cells by more specific types, it is possible the patterns would cluster the same way, and thus you could remove them properly. Alternatively, you could give a try to the num_ref_groups=3 or num_ref_groups=4 option for infercnv to cluster your references into the given number of subgroups and use those new groups for references.

Regards, Christophe.