Using average profile (across all cells) as a reference if reference normal cells are not provided

RegnerM2015 commented 2 years ago

Hi InferCNV team,

From the markdown dated 2022-02-04: https://bioconductor.org/packages/devel/bioc/vignettes/infercnv/inst/doc/inferCNV.html

"This is done by exploring expression intensity of genes across positions of the genome in comparison to the average or a set of reference ‘normal’ cells. A heatmap is generated illustrating the relative expression intensities across each chromosome, and it becomes readily apparent as to which regions of the genome are over-abundant or less-abundant as compared to normal cells (or the average, if reference normal cells are not provided)."

Can you elaborate on this scenario in bold? Is this least preferred relative to providing a set of reference cells?

I am working with tumor samples that are mainly carcinomas, and therefore, I typically include the epithelial cells as observations and the T cells as the background reference. While there are some caveats associated with this (inferCNV will find a lot of cell type-specific differences between epithelial and T cells), would this approach still be preferable as opposed to using the average profile across all cells as the reference background?

Thanks in advance!

GeorgescuC commented 2 years ago

Hi @RegnerM2015 ,

When providing a set of cells as references, we can look at the average expression in those to define our baseline expression when the normal 2x copies are present in healthy cells, and then compare the malignant cells to that and observe fold changes due to gain/loss of copies. When no reference is provided, we can only look at all cells (healthy and malignant) together, so it is not possible to define a baseline that is certain to represent the 2x copies expression level. Depending on which population of cells is most prevalent, and the CNVs in the malignant cells, the average expression across all cells may be closer to that of a CNV, so the normal cells will appear as if they have the opposite type of CNV in comparison. Take for example the small example dataset provided with infercnv where all 4 malignant samples have a deletion in chr1 and chr19. When you run this example with no references, the healthy cells appear to have a duplication at those same positions because they are less numerous (in number of cells), so the average used as baseline is skewed towards the deletion expression level. If we removed the healthy cells entirely from the analysis and only looked at the malignant ones, there would then be no CNV signal at all in those chromosomes, as all cells share the same deletions.

How prevalent are the cell type specific difference you see? There is one area in particular that often shows signal when looking at immune cells (even healthy) that is located on chromosome 6.

Regards, Christophe.

RegnerM2015 commented 2 years ago

Thanks Christophe!

I think I have a better understanding of why the average expression across all cells may not be the best baseline.

As for cell type-specific differences, I typically find amplications on chr1 and chr12 frequently which I think can be attributed to cell type differences (epithelial v. T cell) rather than CNV states. See the following examples below:

This is a normal tissue sample with epithelial cells as observations and T cells as reference background. This should theoretically serve as a negative control. However, we observe some amplification signal on chr1 and chr12:

NormalSampleNegativeControl

This is a tumor sample with epithelial cells as observations and T cells as reference background. As you may see, we observe the same amplification signal on chr1 and chr12.

To avoid these cell type specific differences, do you think it would be advisable or reasonable to use the epithelial cells from a tumor sample as observations and epithelial cells from normal tissue samples as the reference background? The concern would no longer be cell type-specific differences, but batch effects between samples (observations v. reference cells). I think the epithelial cells from 4 normal tissue samples would be enough to overcome these technical concerns though. What are your thoughts?

GeorgescuC commented 2 years ago

Hi @RegnerM2015 ,

Since you have epithelial cells from normal tissue, I would use those yes. To test for the impact of technical differences, you could try running your normal tissue samples as 1 for reference vs 3 for observations, or all 4 as observations with no references.

The signal seen in your 2nd figure among the references in chr6 is the one I was referring to in my previous message that we very often see in immune cells.

Regards, Christophe.

RegnerM2015 commented 2 years ago

To test for the impact of technical differences, you mean running an inferCNV run with epithelial cells from 3 normal samples as observations and epithelial cells from the remaining normal sample as reference?

To my understanding, this negative control experiment would help elucidate the technical differences between normal samples that could then be extrapolated to the technical differences in tumor v. normal comparisons?

In practice, I would run inferCNV for each tumor sample. For each tumor sample, I would use its epithelial cells as observations and the epithelial cells from all 4 normals as the reference background. Does this sound reasonable, having a multi-sample reference background?

Thanks for the help!

GeorgescuC commented 2 years ago

Hi @RegnerM2015 ,

Yes to all questions, as long as the negative control is good.

Regards, Christophe.

broadinstitute / infercnv

Using average profile (across all cells) as a reference if reference normal cells are not provided #395