JEFworks-Lab / HoneyBADGER

HMM-integrated Bayesian approach for detecting CNV and LOH events from single-cell RNA-seq data
http://jef.works/HoneyBADGER/
GNU General Public License v3.0
95 stars 31 forks source link

10x data normalization problem #24

Closed mariamonberg closed 4 years ago

mariamonberg commented 5 years ago

I work with single-cell sequencing technologies, and have been troubleshooting using HoneyBADGER on some of my pancreatic adenocarcinoma scRNA data (generated using 10x genomics chemistry and an illumina nextseq500 platform) to infer CNVs. I've been struggling though, as I'm unsure what best methods to use for normalizing my data for the program, how you structured your reference data from GTex and how I can mimic that with my own "normal" pancreatic sequencing data, and how I might be able to actually get HoneyBADGER to acknowledge my files instead of defaulting to your pre-built references.

Any help you could offer would be greatly appreciated!

JEFworks commented 5 years ago

Hi Maria,

Thanks for the follow up!

Just to give you a little more background as I'm sure you're aware from reading the paper (https://genome.cshlp.org/content/early/2018/06/13/gr.228080.117), the expression-based model in HoneyBADGER is able to identify large-scale copy-number alterations based on consistent deviations in expression from a normal reference. For example, if your cancer sample has a duplication of a chromosome arm, on average, you should see increased expression for the genes within that chromosome arm compared to the normal tissue. Likewise, for a deletion of a chromosome arm, you should see decreased expression compared to the normal tissue. And HoneyBADGER uses a hidden Markov model to identify these potential regions as well as a Bayesian hierarchical approach to quantify the probability of these alterations within individual cells.

Now what this means practically is that we need single-cell RNA-seq cancer data (in your case pancreatic adenocarcinoma) and a normal RNA-seq analogue (in your case normal pancreas). And we need to normalize these two dataset in as much as the same way as possible such that we can interpret deviations in the cancer expression data as signal from underlying CNV changes rather than simply artifacts from normalization differences. In the manuscript, we used a Counts Per Million (CPM) type normalization. You may also use other normalizations as long as they are consistent.

To have HoneyBADGER acknowledge your own files, you will just need to pass them in:

hb <- new('HoneyBADGER', name='pancAdeno')
hb$setGexpMats(YOUR_SCRNASEQ_CPM_NORMALIZED_DATA, 
                              YOUR_NORMALSEQ_CPM_NORMALIZED_DATA, 
                              mart.obj, filter=TRUE, scale=TRUE, verbose=TRUE)

Here, YOUR_SCRNASEQ_CPM_NORMALIZED_DATA would be your CPM normalized single-cell RNA-seq data with genes as rows, cells as columns. And likewise YOUR_NORMALSEQ_CPM_NORMALIZED_DATA would be your CPM normalized bulk RNA-seq GTEX data with genes as rows, samples as columns.

Note that unlike the tutorial, here, we are using your normalized data matrices AND we are setting the filter and scale parameters to TRUE. The built-in datasets have already been pre-normalized, filtered, and scaled.

There are additional filter parameters that you will likely need to toggle. For example, it is well known that single-cell RNA-seq and bulk RNA-seq data differ systematically such that simply summing up expression from single-cell RNA-seq data does not yield the same expression quantifications as bulk RNA-seq data, especially for lowly expressed genes due to technical limitations. Therefore, if we see a gene that's lowly expressed in your normal reference's bulk RNA-seq and not at all expressed in cancer single-cell RNA-seq, this is likely due to technical limitations rather than because there is a deletion in the cancer single-cell RNA-seq. So there are a few parameters minMeanBoth minMeanTest and minMeanRef that simply attempt to filter out such poorly detected genes. See ?setGexpMats for more information.

In the manuscript, for one 10X multiple myeloma dataset we actually had single-cell RNA-seq data for both the cancer and the normal reference, sequenced on the same platform using the same chemistry, etc. In such a case, this kind of filtering was no longer necessary. So do keep in mind what is the purpose of each step in the tutorial and see if it is still appropriate for your analysis.

Hope that helps. Let me know if you have any additional questions, Jean

mariamonberg commented 5 years ago

Hi Jean!

Thank you SO much for the help here! Really appreciate it!

One upstanding issue- I have normalized 2 data sets (hpne_means is normal pancreas tissue, bnc2 is my sample data), and went to run the hb$setGexpMats command to generate the gene expression profile plot. This is the error I received:

hb$setGexpMats(bnc2, hpne_means, mart.obj, filter=TRUE, scale=TRUE, verbose=TRUE) Initializing expression matrices ... 23 genes passed filtering ... Scaling coverage ... Normalizing gene expression for 23 genes and 400 cells ... Done setting initial expression matrices! hb$plotGexpProfile() Error in seq_len(nrow(d)) : argument must be coercible to non-negative integer In addition: Warning message: In seq_len(nrow(d)) : first element used of 'length.out' argument

I have not gotten this error before, even as I was troubleshooting HoneyBADGER on various data sets over the past couple weeks, and I'm not sure what I can do to solve.

Any help??

Thanks, Maria

JEFworks commented 5 years ago

Hi Maria,

Great! Sounds like you're making progress.

So as I mentioned previously, there are a few threshold parameters minMeanBoth minMeanTest and minMeanRef that attempt filter out such poorly detected genes. You can imagine how these threshold parameters may need to be tweaked depending on your dataset. Perhaps your sequencing depth isn't as deep as the dataset I used to set these default parameters.

It looks like there are only 23 genes that remained after filtering! This is a very very small number of genes so most likely the default filter thresholds are way too stringent for your data.

You should look at the distribution of gene expression in your data. For example:

# histogram of average gene expression
hist(rowMeans(bnc2))
# you may want to plot on a log scale
hist(log10(rowMeans(bnc2)+1) 

As well as how well these expression measurements correlate with your bulk. For example:

# correlation of average gene expression
plot(hpne_means, rowMeans(bnc2))
# you may want to plot on a log scale
plot(log10(hpne_means+1), log10(rowMeans(bnc2)+1))
# add lines for your thresholds
abline(v = minMeanTest, col='red') 
abline(h = minMeanRef, col='red') 
abline(v = minMeanBoth, col='blue')
abline(h = minMeanBoth, col='blue')

My sense is there will be some systematic differences, particularly for lowly expressed genes due to drop-out artifacts in your single-cell data. You can use these plots to choose a more sensible threshold for your data.

Best, Jean