cozygene / bisque

An R toolkit for estimation of cell composition from bulk expression data
68 stars 20 forks source link

Bulk deconvolution using public scRNA seq data? #4

Closed Biomiha closed 5 years ago

Biomiha commented 5 years ago

Hi guys,

Kudos for developing this package. I'm really interested to test it on our data. I've had a couple of gos and I'm experiencing some issues though. I'm looking to determine the cell fractions from PBMCs by supplying a publicly a available single cell RNA seq dataset as reference (e.g. the pbmc_3k dataset from the Seurat tutorial), where we have clustered the cells and determined their putative cell types. What I've noticed is that I get nearly identical bulk.props for all the samples, which doesn't seem right. I've tried it on one of our datasets as well as a publicly available one (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115259) and I get nearly exactly the same bulk.prop values across the board.

library(tidyverse)
library(Seurat)
library(Biobase)
library(SingleCellExperiment)
library(BisqueRNA)
matrix_to_expressionSet <- function(mat, ...){
  if(!is.matrix(mat)) warning(deparse(substitute(mat)), " is not a matrix") else {
    featureData <- rownames(mat)
    featureData <- as(as.data.frame(featureData), "AnnotatedDataFrame")
    rownames(featureData) <- rownames(mat)
    phenoData <- colnames(mat)
    phenoData <- as(as.data.frame(phenoData), "AnnotatedDataFrame")
    rownames(phenoData) <- colnames(mat)
    ExpressionSet(assayData = mat, phenoData = phenoData, featureData = featureData) }}
# quick run-through of pbmc scRNA seq dataset
pbmc.data <- Read10X(data.dir = "pbmc3k/filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
                     "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
# Export `Seurat` object to `ExpressionSet`
pbmc_sce <- Seurat::as.SingleCellExperiment(pbmc)
counts(pbmc_sce) <- as.matrix(counts(pbmc_sce))
logcounts(pbmc_sce) <- as.matrix(logcounts(pbmc_sce))
single.cell.expression.set <- as(object = as(object = pbmc_sce, Class = "SummarizedExperiment"), Class = "ExpressionSet")
rownames(single.cell.expression.set) <- pbmc@assays$RNA@data %>% rownames
single.cell.expression.set$cellType <- single.cell.expression.set$ident
single.cell.expression.set$SubjectName <- "Boris" #since we're clowning around :)
# Get bulk data: 
GSE115259_raw <- list.files("~/files/GSE115259/", full.names = TRUE) %>% 
  map(.f = ~read_tsv(.x)) %>% 
  set_names(list.files("~/files/GSE115259/") %>% str_remove(pattern = "_gtf_annotated_genes.results.txt.gz"))
GSE115259_eset <- map(.x = seq_along(GSE115259_raw), .f = ~GSE115259_raw[[.x]] %>% dplyr::mutate(!!quo_name(names(GSE115259_raw[.x])) := GSE115259_raw[[.x]]$expected_count) %>% dplyr::select(annotation.gene_id, starts_with("GSM"))) %>% 
  purrr::reduce(full_join) %>% 
  column_to_rownames(var = "annotation.gene_id") %>% 
  as.matrix %>% 
  matrix_to_expressionSet
# Change ensembl id rownames to symbols
tmp_tbl <- rownames(GSE115259_eset) %>% 
  enframe(name = NULL, value = "ensgene") %>%
  dplyr::mutate(in_anno = 1:n()) %>% 
  left_join(annotables::grch38) %>% 
  dplyr::select(ensgene, symbol, in_anno) %>%
  dplyr::mutate(symbol = case_when(is.na(symbol) ~ ensgene,
                                   TRUE ~ symbol)) %>%
  dplyr::distinct(in_anno, .keep_all = TRUE) %>% 
  dplyr::mutate(is_unque = isUnique(symbol))
GSE115259_eset <- tmp_tbl %>% 
  dplyr::filter(is_unque) %>% 
  pull(ensgene) %>% 
  GSE115259_eset[.]
rownames(GSE115259_eset) <- tmp_tbl %>% 
  dplyr::filter(is_unque) %>%
  pull(symbol)
# Run ReferenceBasedDecomposition
GSE115259_res <- ReferenceBasedDecomposition(bulk.eset = GSE115259_eset, sc.eset = single.cell.expression.set, markers = pbmc@assays$RNA@var.features, use.overlap = FALSE, verbose = TRUE)
GSE115259_res$bulk.props

I'm guessing that this could be because the scRNA seq dataset is not from any of the bulk RNA seq samples, however I imagine this would be the major use case for running deconvolution as long as the reference samples are from the same cell types?

Thanks, Miha

brandonjew commented 5 years ago

Hi Miha,

Thanks for using the package! The issue with the pbmc_3k dataset is that it contains cells from only one individual. Currently, the reference-based decomposition requires multiple individuals in the single-cell data in order to learn the decomposition transformation.

I will update the package to clarify the requirement for multiple samples for this function. Thanks again and let me know if you have any other questions!

Brandon

Biomiha commented 5 years ago

Thanks for the clarification Brandon. How many in your opinion would be needed for a reasonable estimate, obviously bearing in mind these would be different individuals for the scRNAseq data and different for the bulk?

brandonjew commented 5 years ago

The smallest number we used was 6 and we observed reasonable estimates without using the overlap option. The method utilizes the observed variance in cell proportions from the single-cell data, so that should be kept in mind when considering single-cell dataset size.

Biomiha commented 5 years ago

OK, thanks. It seems I can get at least 9 (https://bioconductor.org/packages/release/data/experiment/vignettes/TENxPBMCData/inst/doc/TENxPBMCData.html). I'll have a go and let you know how I get on. BW, Miha

brandonjew commented 5 years ago

There may be an issue with these datasets being sequenced with different kits, since batch effects from the different kits (not sure how significant they are) could cause issues with the generated reference profile. The donor_a, donor_b, donor_c, 4k, and 8k samples seem to be generated with the v2 chemistry, but processed with different cell ranger versions.

Biomiha commented 5 years ago

Hi Brandon,

I used the MNN batch correction from the Marioni lab and it's done a reasonably good job of correcting the batch effects. There was one donor that looked out of place, which I've removed (for better or for worse). Now I get good deconvolution with bisque. One thing I noticed though was that all cell populations needed to be present in all of the donors, which could present issues with either rare cell populations or populations that are affected by treatment or time or other effects. Something to consider maybe.

Thanks, Miha

brandonjew commented 5 years ago

Hi Miha,

Glad to hear and thanks for the insight. We'll put some more thought into rare cell populations only detected in a subset of the individuals used for the reference. In the next update we'll allow the reference profile to be built in these cases and notify the user.

Thanks again! Brandon

AleixAS-mdc commented 4 years ago

Dear Miha,

Could you share with me the Seurat object that you eventually used to deconvolut with bisque and determine the cell fractions from your PBMCs? I am having many issues trying to generate a reference dataset (the Seurat object) that allows bisque to properly work and deconvolut my RNAseq data from PBMCs.

Thanks,

Aleix

brandonjew commented 4 years ago

Hi Aleix,

Thanks for trying out the package. What issues are you running into? I might be able to help with them.

Thanks, Brandon

AleixAS-mdc commented 4 years ago

Hi Brandon,

I was trying to perform the refarence-based decomposition using the available single cell RNA seq dataset available from the the Seurat tutorial (the pbmc_3k dataset).

sc.eset <- BisqueRNA::SeuratToExpressionSet(pbmc3k, delimiter="-", position=2, version="v3")

I converted my expression data (data.frame) into a matrix and then into an ExpressionSet format by:

counts2<-as.matrix(counts) bulk.eset <- Biobase::ExpressionSet(assayData = counts2)

( "bulk.eset" returns: ) ExpressionSet (storageMode: lockedEnvironment) assayData: 52422 features, 465 samples element names: exprs protocolData: none phenoData: none featureData: none experimentData: use 'experimentData(object)' Annotation:

However, when I run:

res <- BisqueRNA::ReferenceBasedDecomposition(bulk.eset, sc.eset, markers=NULL, use.overlap = FALSE)

I get the following error:

Error in GetOverlappingGenes(sc.eset, bulk.eset, markers, verbose) : No overlapping genes found between bulk and single-cell expression.

Which I do not understand because "use.overlap" is set to FALSE. Do you know why I get this error? Then from your discussion with Miha I saw you said (on June) that "Currently, the reference-based decomposition requires multiple individuals in the single-cell data in order to learn the decomposition transformation." which I guess is still true.

I am currently trying to obtain a good Seurat object to use as reference from multiple individuals, but I haven't been successful yet. For this reason I would like to know if Miha (or somobody else) could make available the Seurat object he (or others) used to eventually get good deconvolution with bisque for PBMCs.

Thanks for your help,

Aleix

brandonjew commented 4 years ago

Hi Aleix,

The "use.overlap" parameter is for indicating overlapping individuals between the single-cell and bulk experiments. In this case, the issue is due to no overlapping genes between the datasets (perhaps due to naming differences). You can compare the gene names in each dataset with Biobase::featureNames(eset)

In terms of number of individuals, we have found in our experiments that less than 3 samples used as a reference may be inaccurate to use with our model.

Let me know if you need anything else

Thanks, Brandon

gruben93 commented 3 years ago

Hi,

This is my first time analyzing RNAseq data and I was wondering if anyone could help me. I have bulk RNA seq data that I received after sorting a specifically tagged cell population. I would like to determine what is the cell composition of that sample to help understand what type of cell it is through RNA sequencing. How can I use Bisque to help determine the cell composition of my samples? Also, how should the input data look like?

Thank you for your time!

brandonjew commented 3 years ago

Hi @gruben93, thanks for your interest in our method and welcome to the RNAseq world! Given that your RNAseq was done on a sorted cell population, Bisque would not be useful here (it's designed for quantifying cell types within unsorted bulk tissue samples). You should be able to tell what cell type your sample is composed of based off of the tag used to do the sorting.

If you're concerned about the sorting, you may be able to confirm the cell type by checking if the expected cell type markers (https://panglaodb.se/markers.html?cell_type=%27choose%27) are expressed. One issue with this approach is that it is not clear if the expression of these genes is significantly upregulated if there are no samples from other cell types to compare to. Let me know if you have any other questions