Proposed Analysis: Batch effects in RNA-seq data

natemella commented 3 years ago

We would like to propose a continuation of the issue 448.

What are the scientific goals of the analysis?

Batch effects are commonly observed in RNA expression data. Such effects are likely less pronounced in RNA-Seq data than microarray data. However, batch effects may bias conclusions of studies that do not account for these effects. We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects.

What methods do you plan to use to accomplish the scientific goals?

We propose to use the BatchQC tool to evaluate the data for batch effects. BatchQC provides various visualization tools for evaluating batch effects. We will use these and prepare a summary based on our findings. ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects. Currently, we are unsure whether batch is known for these samples. If batch is known, we will adjust for it using ComBat. If not, we will use SVA to identify surrogate variables and plot those against variables that may have a confounding effect.

What input data are required for this analysis?

RNA-Seq data. We will start with gene-summarized values but may also work with transcript-level data.

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

10-20 hours. It will require at least two steps (examining the data using MultiQC and applying ComBat/sva).

Who will complete the analysis (please add a GitHub handle here if relevant)?

@nathanmella and @srp33

What relevant scientific literature relates to this analysis?

https://www.ncbi.nlm.nih.gov/pubmed/20838408 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5167063/ https://www.ncbi.nlm.nih.gov/pubmed/22257669

As shown in our previous work, we found identified the extent to which batches and histologies are confounded. We plan to finish this analysis and complete the following:

Transfer R scripts to an R Notebook file that is documented with the corresponding graphs
Clean up code with all functions living in one file in the directory
Update Docker File
Submit a pull request

jaclyn-taroni commented 3 years ago

Linking the closed pull request relevant to this analysis as well: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/628

jharenza commented 3 years ago

Hi @natemella! @komalsrathi is familiar with some of these analyses through our work at CHOP, so I would love for her to weigh in here with her experience!

natemella commented 3 years ago

Thanks, @jharenza for making the introduction. @komalsrathi, I'm excited to collaborate with you.

komalsrathi commented 3 years ago

Hi @natemella good to meet you virtually.

So here's what we have done. Our goal was to combine 3 different RNA-seq datasets for the purpose of differential gene expression analysis:

Tumor samples from various studies like CBTN, CBTTC, PNOC003, PNOC008 (mix of stranded, poly-A and RNA exome)
GTEx Brain normals (poly-A)
Brain normals from TGEN (poly-A)

GTEx and TGEN consist only of poly-A samples but we have stranded, poly-A and a third RNA enrichment method called TruSeq RNA Exome for the tumor samples. So, prior to combining the datasets, we wanted to correct not only for the different studies that we were combining but also wanted to correct for the different library types. For our purpose, we defined batch as a combination of study identifier i.e. GTEx, TGEN or OpenPBTA and library type i.e. stranded or poly-A.

Here, is the breakdown of the batches we have and as mentioned above, a batch is a combination of study-identifier and library-type separated by _:

> plyr::count(dat$batch)
              batch freq
1     CBTN_stranded    1
2      CBTTC_poly-A   20
3    CBTTC_stranded  112
4       GTEx_poly-A 1152
5    PNOC003_poly-A   32
6 PNOC008_rna_exome    9
7  PNOC008_stranded   12
8       TGEN_poly-A    5

We wanted to evaluate the following methods for batch-correction from the R package sva:

Combat (developed for microarray but can be adapted to RNA-seq) and
Combat_Seq (newer and specifically for RNA-Seq datasets)

In our case, we could not evaluate ComBat-seq because it does not support 1 sample per batch and only takes counts as input. Because we needed to work with TPM values (for downstream analysis) and we have one sample corresponding to the CBTN_stranded batch, we decided to use ComBat as it works fine using the mean only version in the presence of 1 sample per batch.

Here is my repo corresponding to the batch correction code: https://github.com/komalsrathi/rnaseq-batch-correction/tree/rnaseq-batch-correct/analyses/rnaseq-batch-correct/.

There are a couple of scripts and I'll quickly elaborate:

Note: we are working with TPM values

# script to merge RSEM files from all input samples into 1 matrix of gene_id vs sample matrix with TPM values (if applicable)
# this is done separately for each study
00-merge-rsem.R 

# script to take the gene_id vs sample matrix and convert it into a matrix with unique gene symbols and sample matrix containing TPM values
# this is done separately for each study
01-collapse-matrices.R  

# script to join multiple datasets on common gene symbols into one master expression matrix
02-combine-matrices.R   

# script to create clinical file for each dataset to combine 
# the goal of this script is to create a clinical file for each study that contains just two columns: `identifier` matching the samples in the expression matrix and `batch` created by combining study identifier and library type
02-create-clin.R        

# script to combine clinical files created above into one master clinical file
03-combine-clin.R       

# script to batch correct the master TPM expression matrix using the batch information in the master clinical file
# we input log-normalize TPM values i.e. `log2(TPM + 1)` to the ComBat function
# because the batch corrected values are also log-normalized but we wanted to use the matrix to do other normalizations like z-score, we back transform the matrix essentially giving us batch-corrected non-normalized TPM values.
04-batch-correct.R      

# script to create t-SNE and density plots
# the goal here was to use the corrected and uncorrected TPM expression matrices in order to visualize distribution and clustering of house keeping genes
05-qc-plots.R

Code snippet for back transforming batch corrected values: https://github.com/komalsrathi/rnaseq-batch-correction/tree/rnaseq-batch-correct/analyses/rnaseq-batch-correct/code/04-batch-correct.R#L69-L71

Density plot of housekeeping genes: https://github.com/komalsrathi/rnaseq-batch-correction/tree/rnaseq-batch-correct/analyses/rnaseq-batch-correct/plots/pbta-hgat-dx-prog-pm-tgen-gtex-tpm-density.pdf

t-SNE clustering of the entire matrix as well as housekeeping genes: https://github.com/komalsrathi/rnaseq-batch-correction/tree/rnaseq-batch-correct/analyses/rnaseq-batch-correct/plots/pbta-hgat-dx-prog-pm-tgen-gtex-tpm-tsne.pdf

I have tried to elaborate as much as possible, would appreciate your input and please feel free to ask any questions if not clear.

cc: @jharenza @jaclyn-taroni

jaclyn-taroni commented 3 years ago

@komalsrathi I keep getting 404s for those links - is the repository you are linking to private?

komalsrathi commented 3 years ago

@komalsrathi I keep getting 404s for those links - is the repository you are linking to private?

Hi @jaclyn-taroni I am fixing it, didn't realize it is a fork of a private repo so not accessible.

komalsrathi commented 3 years ago

@jaclyn-taroni @jharenza I have just created a new public repo (updated my comment above). Please let me know if you have any issues.

jharenza commented 3 years ago

thanks @komalsrathi !

srp33 commented 3 years ago

Thanks to all of you for you comments! As you may know, Nathan (a student in my lab) worked on an analysis of batch adjustment early last year. It looks like @komalsrathi has also done some work on adjusting for batch.

@jaclyn-taroni or @jharenza I'm unsure of the current state of the paper and whether our previous analyses were useful or whether there's anything we could add to make it useful? We would also be happy to work with @komalsrathi if there's anything beneficial that we could add to her analyses. Or if not, that's fine, too. Please let me know your thoughts. (Sorry that I'm not as familiar as I should be with the process you are using.) Thanks!

jaclyn-taroni commented 3 years ago

Hi @srp33, thanks for your comment!

I will summarize the state of the analyses within this project. I am less familiar with @komalsrathi's analysis, as https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/919#issuecomment-766826554 is the first time it has been introduced to me.

The short(ish) version: The results of batch correction we've seen so far from your group have not been included in the repository or project because they were not merged into the code base here. The lack of batch correct so far has resulted in a focus mostly on the larger, stranded dataset in current versions of the "overview" type display items (figures/pngs/transcriptomic-overview.png). I'd welcome the inclusion of effective batch correction here, because I think might allow for more possibilities to talk about the poly-A dataset in our descriptive analyses. I have no immediate use case in mind for differential expression analysis between normal samples from GTEx and TGEN for this particular project and I do not believe the PNOC0008 data is included here, which seem to be key features of what is described in https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/919#issuecomment-766826554.

As a rule, we will not use any analyses or output files in this project that do not make it through the analytical code review process.

The longer version, with links to earlier discussion for context:

There was an initial pull request to this project, #628, opened in March 2020 that did not get merged. Because that pull request was open for a few months, we closed it in August 2020 when the Docker image got overhauled (altering Docker was one of the outstanding issues) https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/628#issuecomment-674993072.

There were a few outstanding question on the pull request related to the utility of the batch correction, quoting @cgreene from https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/628#issuecomment-598688018 here:

You've batch effect corrected for different datasets, but we suspect those may be confounded with disease type. this means your attempts to remove unwanted variability might actually induce other confounded variability. If you wanted to check this, you'd pick one or a few dominant histologies from those centers, compare them, and at least see that:

Those histologies now overlap, or at least overlap better than they did before.

The histologies that are not those within the center(s) in question retain roughly the same variability, looking within center, that they did before the data were merged.

Without doing this analysis, I don't think there's any way to know if this correction is helping or hurting.

There were some results posted in response to @cgreene's comments here: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/448#issuecomment-611589858

Follow-up suggestion for how to move forward with getting the code into the repository: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/448#issuecomment-612998140

Outcome from a meeting on April 21, 2020: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/448#issuecomment-617364339 - those action items are repeated in this issue 👍

I would welcome pull requests for batch correction if they are coming soon! 😄

Our pull request model works best if the code additions in any one pull request are pretty limited in scope so they can be reviewed in a timely manner. (@komalsrathi may be a great candidate for a scientific review!) It is also helpful if they do not hang out for too long without updates & re-review because they can become out of date with what is in the master branch (e.g., earlier issue with the Docker image).

Please let me know if you have questions - happy to chat via another medium such as our Cancer Data Science Slack!

srp33 commented 3 years ago

Just wanted to let you know that @natemella is actively working on this, and we'll have something for you as soon as we can.

natemella commented 3 years ago

There are two methods for doing the batch correction: combat() and combat_seq(). Combat_seq() showed mixed results.

Here's a before and after where combat-seq worked

Here's a before and after where combat-seq didn't work

The second example is fixed through combat

komalsrathi commented 3 years ago

Hi @natemella are you using TPM input for both Combat and ComBat_seq? I think the latter is only meant to take raw counts as input (from what I have read in the docs and the various discussions around using FPKM/TPM as input for batch correction).

natemella commented 3 years ago

Yes, I'm using TPM. Your explanation makes a lot more sense as to why ComBat_seq hasn't worked well. Thanks for reaching out.

AlexsLemonade / OpenPBTA-analysis