False positives for PBMC data

SirKuikka commented 1 year ago

Hi,

I'm getting some false positives results with inferCNV when I try to detect CNVs from myeloid cells using lymphocytes cells as reference. I tested this with a simple PBMC dataset from healthy individuals and I get similar results. Below is the inferCNV plot for the PBMC dataset.

This is the code of my analysis.

tumor_cell_types <- c("ASDC","CD14 Mono","CD16 Mono","cDC1","cDC2","pDC")
refefence_cell_types <- c("B intermediate","B memory","B naive",
                          "CD4 Naive","CD4 Proliferating", "CD4 TCM", "CD4 TEM","CD8 Naive",
                          "CD8 TCM","CD8 TEM", "dnT","gdT","MAIT","NK","Treg","Plasmablast")

so_s <- so[,so$predicted.celltype.l2 %in% c(tumor_cell_types,refefence_cell_types)]

counts_matrix <-  so_s[["RNA"]]$counts
cell_bcs <- colnames(counts_matrix)
cell_types <- so_s$predicted.celltype.l2

cell_types_to_include <- names(table(cell_types))[table(cell_types)>1]
cells_to_include <- cell_types %in% cell_types_to_include 

counts_matrix <- counts_matrix[,cells_to_include]
cell_bcs <- cell_bcs[cells_to_include]
cell_types <- cell_types[cells_to_include]

X <- cbind(cell_bcs,cell_types)
colnames(X) <- c("V1","V2")

write.table(X,file = "annotations_downsampled.txt",
            row.names = FALSE,sep = "\t",quote = FALSE)

library(infercnv)

infercnv_obj = CreateInfercnvObject(raw_counts_matrix=counts_matrix ,
                                    annotations_file="annotations_downsampled.txt",
                                    delim="\t",
                                    gene_order_file="Homo_sapiens.GRCh38.93.txt",
                                    ref_group_names=refefence_cell_types)

sample_name <- "pbmc10k"

infercnv_obj = infercnv::run(infercnv_obj,
                             cutoff=0.1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
                             out_dir=sample_name, 
                             cluster_by_groups=TRUE, 
                             denoise=TRUE,
                             HMM=TRUE,
                             no_plot=FALSE)

save(infercnv_obj,file=paste0(sample_name,"_inferCNV.RData"))

GeorgescuC commented 1 year ago

Hi @SirKuikka ,

First thing, looking at the dendrogram on the left side of the figure, this is a case where the subclustering is too fragmented, so the HMM results are not going to be very useful because most clusters have very few cells. This results in a lot of noise being picked up as signal. Lowering the leiden_resolution parameter should help with this.

Besides that, there are indeed regions where the signal is consistent across all cells so the subclustering would not affect that. One potential reason for this is that you have a lot of different cell types in your analysis, and they do not seem to be the same between references and observations based on the names. One of the first steps of infercnv is to check the average expression per gene across all cells and filter those below the cutoff. When you have a mix of many cell types, and in varied proportions, it is easy to keep genes that are highly expressed or discard genes that are not expressed in the most prevalent cell type(s), but have different expression patterns in others. If there are too many such genes in a region, it can show up as signal that is not representative of CNVs but cell type differences. Even assuming filtering is not an issue, cell type differences can still lead to signal appearing. I would try running the same data through multiple smaller runs of infercnv where you focus on specific cell types, and split the same cell type in a set of cells to use as reference and a set of cells to use as observations (similar to a cross validation).

You could also take a look at which genes are present in the regions that have signal across all cells to see if there is anything specific about them. There are cases such as the cluster of MHC genes on chromosome 6 that often display 2 different patterns of expression in normal cells and only one of those in non healthy cells, but this might not show up in your run because you have more precisely defined cell types.

Regards, Christophe.

SirKuikka commented 1 year ago

Hi,

Thank you for the help.

I will test the cross validation as suggested.

But if I use subsets of the same cell types as reference and observations, is inferCNV going to be able to find any true CNVs? My exprectation would be that there would be no findings.

SirKuikka commented 1 year ago

So if I understood correctly, I would run this analysis repeatedly and see if some CNVs show up repeatedly in same cells? These would likely be real? I selected specific cell types (myeloid cells), split the data randomly into two parts, and used one half as references and the other as observations. Your answer mentioned splitting a cell type into reference and observation. So should I instead use only one cell type as reference (one of the random halves).

counts_matrix <- seurat_object@assays$RNA@counts[,seurat_object$orig.ident==sample_name]
  cell_bcs <- colnames(counts_matrix)
  cell_types <- as.character(seurat_object$cluster[seurat_object$orig.ident==sample_name])

  cell_types_to_include <- names(table(cell_types))[table(cell_types)>10]
  cell_types_to_include <- intersect(myeloid_cell_type_names,cell_types_to_include)
  cells_to_include <- cell_types %in% cell_types_to_include 

  counts_matrix <- counts_matrix[,cells_to_include]
  cell_bcs <- cell_bcs[cells_to_include]
  cell_types <- cell_types[cells_to_include]

  i_tumor <- sample(dim(counts_matrix)[2],ceiling(ncol(counts_matrix)/2))
  i_ref <- setdiff(1:dim(counts_matrix)[2],i_tumor)

  X <- cbind(cell_bcs,cell_types)
  colnames(X) <- c("V1","V2")

  X_cancer <- X[i_tumor,]
  counts_matrix_cancer <- counts_matrix[,i_tumor]

  X_reference <- X[i_ref,]
  counts_matrix_reference <- counts_matrix[,i_ref]

  X_reference[,2] <- paste0("reference_",X_reference[,2])

  counts_matrix <- RowMergeSparseMatrices(counts_matrix_cancer,
                                          counts_matrix_reference)

  X <- rbind(X_cancer,X_reference)

  write.table(X,file = "annotations_downsampled.txt",
              row.names = FALSE,sep = "\t",quote = FALSE)

  refefence_cell_types <- unique(X_reference[,2])

  library(infercnv)

  infercnv_obj = CreateInfercnvObject(raw_counts_matrix=counts_matrix,
                                      annotations_file="annotations_downsampled.txt",
                                      delim="\t",
                                      gene_order_file="Homo_sapiens.GRCh38.93.txt",
                                      ref_group_names=refefence_cell_types)

  infercnv_obj = infercnv::run(infercnv_obj,
                               cutoff=0.1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
                               out_dir=sample_name, 
                               cluster_by_groups=TRUE, 
                               denoise=TRUE,
                               HMM=TRUE,
                               no_plot=FALSE)

broadinstitute / infercnv

False positives for PBMC data #549