JEFworks-Lab / HoneyBADGER

HMM-integrated Bayesian approach for detecting CNV and LOH events from single-cell RNA-seq data
http://jef.works/HoneyBADGER/
GNU General Public License v3.0
95 stars 31 forks source link

gene filtering different in HoneyBADGER object #39

Open JocelynSP opened 3 years ago

JocelynSP commented 3 years ago

Hi, I have just installed HoneyBADGER, planning to use it for sc 10X analysis. I am seeing radically different filtering of genes between gexp.mats <- setGexpMats(...) and hb$setGexpMats(...) For example with my own data:

> hb <- new('HoneyBADGER', name=paste0( samplegroup, "LN1vTum1"))
> hb$setGexpMats( as.matrix(gexpOI), as.matrix(gexpRef), mart.obj,
+                 filter=TRUE, scale=TRUE )
Initializing expression matrices ... 
111 genes passed filtering ... 

> gexp.mats <- setGexpMats(as.matrix(gexpOI), as.matrix(gexpRef), mart.obj,
+                          filter=TRUE, scale=TRUE)
Initializing expression matrices ... 
15282 genes passed filtering ... 

Or with the example data (which doesn't need filtering, but I used for a comparison test):

> hb <- new('HoneyBADGER', name='MGH31')
> testexample <- setGexpMats(gexp, ref, mart.obj, filter=TRUE, scale=FALSE, verbose=TRUE)
Initializing expression matrices ... 
3463 genes passed filtering ... 
Normalizing gene expression for 3463 genes and 75 cells ... 
Done setting initial expression matrices!                                                   
> hb$setGexpMats(gexp, ref, mart.obj, filter=TRUE, scale=FALSE, verbose=TRUE)
Initializing expression matrices ... 
39 genes passed filtering ... 

Can you tell me what is wrong with the internal version? Thanks!

JEFworks commented 3 years ago

Hi Jocelyn,

Thanks for reaching out.

Astute observation. The general goal of the filtering is to remove lowly expressed genes that are not well detected in the single-cell data for technical reasons so that we can focus on gene with lower expression due to CNV losses for example. Initially, when we developed HoneyBADGER, we used it on SmartSeq2 single cell RNA-seq data and the default filters were set with SmartSeq2 data in mind. When we expanded to 10X data as well, we decided to use less stringent filters by default and updated the default parameters of setGexpMats but forgot to update hb$setGexpMats. You can look into the default parameters for both functions to see the difference.

## for setGexpMats
minMeanBoth=0, minMeanTest=mean(gexp.sc.init[gexp.sc.init!=0]), minMeanRef=mean(gexp.ref.init[gexp.ref.init!=0]),
## for hb$setGexpMats
minMeanBoth=4.5, minMeanTest=6, minMeanRef=8,

This does need to be fixed. If you would be so kind as to help update hb$setGexpMats to have the default parameters as setGexpMats and make a pull request, I'd be happy to accept your pull so you can get credit for your contribution.

The example data has already been filtered and normalized relative to the reference data. Note that after normalization to the reference data, we cannot use the same filtering criteria.

Hope that helps, Jean

CloudCurio commented 1 year ago

Hi, I know that this is an old thread, but I'm getting a similar issue, so I though I'll continue it here instead of creating duplicate threads. I'm also working with 10x data, and have turned off filtering like this:

hb$setGexpMats(cancer_matrix, ref_matrix, mart_obj, 
               filter=FALSE, scale=FALSE, verbose=TRUE)

Still, I'm getting an error like this:

ERROR NO GENES AFFECTED BY CNVS IDENTIFIED! Run calcGexpCnvBoundaries()? Error in rowMeans(amp.gexp.prob) :
  'x' must be an array of at least two dimensions
Calls: <Anonymous> -> cbind -> cbind -> data.frame -> rowMeans

Preceeding the error, I get deletion probability of 0 for every cnv, and amplification probabilities between 0.5 and 0.6. When probabilities per barcode are created, I see the same picture with probabilities between 0.5 and 0.6. I believe that the issue is happening because all significantly differing datapoints are filtered out previously, but can't pinpoint it, since, again, the filtering option of setGetxpMaps method is turned off. Was this issue fixed previously, or is it still open? Thank you in advance.

Best regards, Dmytro

JEFworks commented 1 year ago

Dear Dmytro,

Thanks for reaching out.

Does running setGexpMats provide you with any other message when verbose=TRUE?

Without seeing a snippet or knowing the dimensions of your cancer_matrix and ref_matrix, my best guess is that the gene names in these matrixes do not match up with gene names in the mart.obj annotations.

If calcGexpCnvBoundaries is returning an error that there are too few genes, you may want to check the dimensions of the gexp.norm and genes slot in your hb object to see if it's an issue with there being too few genes at this stage or perhaps just certain chromosomes having no genes. By default, all chromosomes are tested via the parameter chrs=paste0('chr', c(1:22)) though we have seen instances with smaller chromosomes like chromosome 21 having very few genes detected when we do shallow sequencing for example.

Hope this helps, Jean