JEFworks-Lab / HoneyBADGER

HMM-integrated Bayesian approach for detecting CNV and LOH events from single-cell RNA-seq data
http://jef.works/HoneyBADGER/
GNU General Public License v3.0
95 stars 31 forks source link

Usage of HoneyBADGER on 10x data? #14

Closed kokonech closed 5 years ago

kokonech commented 5 years ago

Hi! I installed the HoneyBADGER on MacOS with R 3.5.0. Afterwards went though tutorial and everything worked fine. My main question: is the package suitable for 3' focused 10X data? Are there some settings adjustment for such data analysis?
I tried to apply the tool usage on 10X sample with gene counts CPM normalized and log2 adjusted (~2500 cells). However, several issues were introduced. The visualisation function hb$plotGexpProfile() got stuck and did not work in RStudio. Then I tried to make a PDF and this was OK, however did not work for PNG or SVG generation. Further the analysis got stuck on function hb$calcGexpCnvBoundaries(init=TRUE, verbose=FALSE), it took more than 3 hours and R session crashed. What could’ve been the problem? RAM limits? I am going to try rerunning it on Linux cluster as well, but thought maybe this could be improved from settings control.

JEFworks commented 5 years ago

Hi Konstantin,

Yes, HoneyBADGER is suitable for 3' only sequencing data such as 10X data, though your CNV detection resolution will be compromised due to the reduced SNP density. We anticipate that you will be able to reliably identify only CNVs that are chromosome-arm sized or larger. This is discussed more in depth in the original publication.

That being said, we have not extensively tested the memory usage and runtime, though it is strange that plotGexpProfile would cause anything to crash. Can you please check that the function works with the built-in data?

#' @examples 
#' data(gexp)
#' data(ref)
#' require(biomaRt) ## for gene coordinates
#' mart.obj <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", 
#'     dataset = 'hsapiens_gene_ensembl', 
#'     host = "jul2015.archive.ensembl.org")
#' hb <- HoneyBADGER$new()
#' hb$setGexpMats(gexp, ref, mart.obj, filter=FALSE, scale=FALSE)
#' hb$plotGexpProfile() 

calcGexpCnvBoundaries uses an HMM to identify potential CNVs and then tests each using a Bayesian hierarchical model. It also does this iteratively to identify potentially smaller CNVs. If there are many potential CNVs identified, it can take awhile to run. For 10X data, to save runtime, you can set min.traverse=1 since smaller CNVs are likely not going to be identifiable anyway.

I'm not sure why it crashed your R session though. Perhaps there's a memory leak somewhere. Can you please try calculating the posterior probabilities for a manually chosen (instead of HMM identified) CNV using the built in data (or feel free to substitute in your data):

#' @examples
#' data(gexp)
#' data(ref)
#' require(biomaRt) ## for gene coordinates
#' mart.obj <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", 
#'     dataset = 'hsapiens_gene_ensembl', 
#'     host = "jul2015.archive.ensembl.org")
#' gexp.mats <- setGexpMats(gexp, ref, mart.obj, filter=FALSE, scale=FALSE)
#' mvFit <- setMvFit(gexp.mats$gexp.norm)
#' results <- calcGexpCnvProb(gexp.mats$gexp.norm, gexp.mats$genes, 
#'     mvFit, region=GenomicRanges::GRanges('chr10', IRanges::IRanges(0,1e9)), verbose=TRUE)

This will help determine whether it's the HMM that is crashing things or the Bayesian model.

Thanks!

Best, Jean

kokonech commented 5 years ago

Hi Jean,

thanks a lot for the detailed reply!

I tested all your code, and everything went well.

With my dataset probably the memory was the main issue. I decreased the number of cells to ~1000 and rerun analysis using min.traverse=1. It took about ~8 hours. Longest procedures were calcGexpCnvBoundaries and retestIdentifiedCnvs. Here's the code:

hb <- new('HoneyBADGER', name='GBM61') # input data normalization: log2 ( CPM / 10 + 1 )
hb$setGexpMats(gbm61_expr, ref, mart.obj, filter=FALSE, scale=FALSE, verbose=TRUE)
hb$setMvFit(verbose=TRUE)  
hb$setGexpDev(verbose=TRUE) 
# set min.traverse 
hb$calcGexpCnvBoundaries(init=TRUE, verbose=FALSE,min.traverse = 1) 
# check preliminary result
bgf <- hb$bound.genes.final
genes <- hb$genes
regions.genes <- range(genes[unlist(bgf)])
# result was a set  22 regions covering almost whole chromosome e.g. chr1:1373730-246931978
print(regions.genes)
hb$retestIdentifiedCnvs(retestBoundGenes = TRUE, retestBoundSnps = FALSE, verbose=FALSE)
results <- hb$summarizeResults(geneBased=TRUE, alleleBased=FALSE)

However the final function summarizeResults did not work and returned:

Error in dimnames(x) <- dn : length of 'dimnames' [1] not equal to array extent

If I understand correctly, there were no confirmed CNV detected. That is surprising, since from bulk comparison there should be expected CNVs. Did I miss something in the code? Could it be the issue with some options control? Is there effect of normalization method?

JEFworks commented 5 years ago

Great!

Yes, those steps may take quite awhile if there are many potential CNVs identified. Do you have access to multiple cores? There is an n.cores parameter that is by default 1 but can be increased if multiple cores are available for parallel processing.

However the final function summarizeResults did not work

This is actually a bug that was just address in commit 4eb9c06

For a work-around without having to reinstall, please see: https://github.com/JEFworks/HoneyBADGER/issues/13

kokonech commented 5 years ago

Thanks a lot for detailed reply and the fix! I relaunched the novel git pull and will update about the results with taking into account multi-core option. However, I also went through this manually and used the following code similar to novel source in order to check the results:

rgs <- hb$cnvs[["gene-based"]][["all"]]
retest <- hb$results[["gene-based"]]
amp.gexp.prob <- do.call(rbind, lapply(retest, function(x) x[[1]]))
del.gexp.prob <- do.call(rbind, lapply(retest, function(x) x[[2]]))
min.num.cells <- 2
vi1 <- rowSums(amp.gexp.prob > 0.75) > min.num.cells
vi2 <- rowSums(del.gexp.prob > 0.75) > min.num.cells

summary(v1)
# Mode    TRUE
# logical      22

summary(vi2)
#   Mode   FALSE
# logical      22

Surprisingly, the full list (22 regions covering chromosomes) resulted in TRUE amplification, while no deletions at all were observed. However, according to bulk data from the sample, there should be exactly several chromosomal deletions.

Would could be the issue here? Could it be the reference dataset impact? I used the standard “ref” normal brain GTEx object as in the tutorial since it should fit well for GBM tumor, but the distribution there seems not to fit log2 CPM distribution from our sample.

Also, in your manuscript you mentioned that used ref the "normal cells" expression from the 3' target sample (MM135). But what did you use as reference control in the beginning? Did the allele calling work well on this 10X sample with standard options?

JEFworks commented 5 years ago

Surprisingly, the full list (22 regions covering chromosomes) resulted in TRUE amplification, while no deletions at all were observed. However, according to bulk data from the sample, there should be exactly several chromosomal deletions.

Hum, did you filter and scale the genes? In the tutorial, genes were already filtered and scaled. Therefore, the filter and scale parameters were set to FALSE.

hb$setGexpMats(gexp, ref, mart.obj, filter=FALSE, scale=FALSE, verbose=TRUE)

For your dataset, the genes will need to be filtered. The filtering thresholds (such as minMeanTest, minMeanRef, minMeanBoth) for 10X data will likely need some tweaking from the default. See ?setGexpMats for more detail.

Also, in your manuscript you mentioned that used ref the "normal cells" expression from the 3' target sample (MM135). But what did you use as reference control in the beginning? Did the allele calling work well on this 10X sample with standard options?

In the manuscript, for MM135, we used the allele-based method to identify putative normal cells and then used those normal cells as the normalization reference for the expression-based method. Allele calling does work with 10X, though your SNP density is substantially reduced as expected. That being said, we've also used the GTEx normal brain as expression references before for GBM and they've seemed to work reasonably. We did not use a GTEx reference for the MM135 sample because GTEx only had sequencing of mixed PBMCs, rather than sorted blood cell-types.