RNAseq tumor only analysis issue

What data file(s) does this issue pertain to?

gene-counts-rsem-expected_count-collapsed.rds

What release are you using?

v11

Put your question or report your issue here.

Potential issue with inflated type I error when performing tumor-only differential gene expression analyses (though please verify whether DESeq2 design is correct).

When attempting to complete the batch correction module using RUVseq analysis on TARGET Neuroblastoma data (mycn amplified vs non-amplified) as a use case (comparing standard DGE to batch-corrected DGE), all genes were found to be differentially expressed by DESeq2:

tabular results of p-values here: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/168#issuecomment-1123808276
distribution of gene ratios does not indicate a systematic difference between mycn amplified vs non-amplified tumors: https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/168#issuecomment-1152545558
the issue does not seem unique to TARGET data, as PBTA comparison of HGG WT vs DMG H3 K28M tumors yields a similar result using both DESeq2 (nearly all genes differentially expressed) and EdgeR (code below) on v11 data. Analysis that was proposed here

This appears to be due to the fact that DESeq2 and EdgeR rely on fitting to a negative binomial distribution, and many gene features appear to violate this assumption (QQ plot below):

Potential solution: evaluate non-parametric DGE analysis methods that do not rely on gene-level distributional assumptions (e.g. NOISeq bioconductor package).

Code for testing HGG WT vs DMG H3K28M and generating qq plot:

# script to use HGG WT vs DMG H3K28M PBTA data as a use case for RUVSeq batch correction

# load libraries
suppressPackageStartupMessages({
  library(optparse)
  library(tidyverse)
  library(edgeR)
  library(RUVSeq)
  library(EDASeq)
  library(uwot)
})

# source functions
source('util/umap_plot.R')
source('util/edaseq_plot.R')
source('util/deseq2_pvals_histogram.R') # DESeq2 pval histograms
source('util/box_plots.R') # boxplots for samples
source('util/ruvg_test.R') # function to run RUVg

# define directories
## data dir

root_dir = rprojroot::find_root(rprojroot::has_dir('.git'))
data_dir = file.path(root_dir, 'data', 'v11')

## output dirs

umap_output_dir <- file.path('output', 'pbta_hgg_test', 'umap')
dge_output_dir <- file.path('output', 'pbta_hgg_test', 'dge')
plot_dir <- file.path('output', 'pbta_hgg_test', 'plots')

dir.create(dge_output_dir, showWarnings = F, recursive = T)
dir.create(umap_output_dir, showWarnings = F, recursive = T)
dir.create(plot_dir, showWarnings = F, recursive = T)

# histology file
hist_file <- read.delim(file.path(data_dir, 'histologies.tsv'))
hist.hgg <- hist_file %>%
  filter(experimental_strategy == "RNA-Seq",
         cohort %in% "PBTA", 
         harmonized_diagnosis %in% c('High-grade glioma/astrocytoma, H3 wildtype', 'Diffuse midline glioma, H3 K28-mutant'))

# HRT atlas genes
hk_genes_hrt <- readRDS('input/hk_genes_normals.rds')

# read expression data
counts <- readRDS(file.path(data_dir, 'gene-counts-rsem-expected_count-collapsed.rds'))
counts.hgg <- counts %>%
  dplyr::select(hist.hgg$Kids_First_Biospecimen_ID)

# filter for zero count and protein coding genes from gencode v27
counts.hgg <- counts.hgg[rowSums(counts.hgg) > 0,] # remove all genes with zero counts across all samples

gencode_gtf <- rtracklayer::import(con = file.path(root_dir, 'data', 'gencode.v27.primary_assembly.annotation.gtf.gz'))
gencode_gtf <- as.data.frame(gencode_gtf)
gencode_pc <- gencode_gtf %>%
  dplyr::select(gene_id, gene_name, gene_type) %>%
  filter(gene_type == "protein_coding") %>%
  unique()

counts.hgg <- counts.hgg %>%
  rownames_to_column("gene") %>%
  filter(gene %in% gencode_pc$gene_name) %>%
  column_to_rownames("gene")

# diagnostic umap of HGG vs DMG
## high variance genes above 90% quantile (3666 x 19995)

row_variances <- matrixStats::rowVars(as.matrix(counts.hgg))
high_var_exp <- counts.hgg[which(row_variances > quantile(row_variances, 0.9)), ]

## housekeeping gene 

hk_genes <- counts.hgg.log[rownames(counts.hgg.log) %in% hk_genes_hrt,]

# test RUVseq vs no RUVseq DESeq2 models
#bs_id <- gsub('TARGET-[0-9]{2}-', '', selected_htl_df$Kids_First_Biospecimen_ID) # shorten bs ids for TARGET samples

## no RUVseq DGE
### build design matrix
harmonized_diagnosis <- factor(as.character(hist.hgg$harmonized_diagnosis))
design <- model.matrix(~ 0 + harmonized_diagnosis)
bs_id <- hist.hgg$Kids_First_Biospecimen_ID
RNA_library = hist.hgg$RNA_library

### filter lowly expressed genes and test by EdgeR
counts_object <- edgeR::DGEList(counts = counts.hgg, group = harmonized_diagnosis)
counts_object <- calcNormFactors(counts_object)
keep <- edgeR::filterByExpr(counts_object, group = harmonized_diagnosis)
counts_object_filtered <- counts_object[keep, , keep.lib.sizes = F]
counts_object_filtered <- estimateDisp(counts_object_filtered, design)
fit <- glmQLFit(counts_object_filtered, design)
qlf <- glmQLFTest(fit, contrast=c(1,0))
res <- topTags(qlf, n = Inf)

### build EDA expressionset for DESeq2 analysis
seq_expr_set <- EDASeq::newSeqExpressionSet(counts = round(counts_object$counts), phenoData = data.frame(bs_id, RNA_library, harmonized_diagnosis, row.names = colnames(counts.hgg)))

# from https://support.bioconductor.org/p/95805/#95808
# The "betweenLaneNormalization" is just a poorly named function that perform between-sample normalization, 
# independently of the fact that they are organized in lanes, plates, etc. 
# It is used in RUVSeq just to make sure that the factors of unwanted variation don't capture the difference in library size that should be captured by the size factors. 
# normalize the data using upper-quartile (UQ) normalization
seq_expr_set <- EDASeq::betweenLaneNormalization(seq_expr_set, which = "upper")
umap_p_norm <- edaseq_plot(object = seq_expr_set, title = "After UQ normalization", type = "UMAP", color_var = "harmonized_diagnosis", shape_var = "RNA_library")

p <- ggpubr::ggarrange(umap_p_norm, common.legend = T, legend = "bottom")
fname <- file.path(umap_output_dir, 'umap_hgg_uq_norm.pdf')
ggsave(filename = fname, plot = p, width = 6, height = 6, device = "pdf", bg = "white")

# 1. DESeq2::DESeq performs a default analysis through the steps:
# - estimation of size factors: estimateSizeFactors
# - estimation of dispersion: estimateDispersions
# - Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest
dds <- DESeq2::DESeqDataSetFromMatrix(countData = round(counts(seq_expr_set)), colData =  data.frame(bs_id, harmonized_diagnosis), design = design)
dds <- DESeq2::DESeq(dds)
dge_output <- DESeq2::results(dds, cooksCutoff = FALSE, pAdjustMethod = 'BH')
dge_output <- dge_output %>% 
  as.data.frame() %>% 
  rownames_to_column('gene') %>%
  arrange(padj) 

# Test for goodness of fit to negative binomial distribution for a sample of high variance genes defined above

test = tweeDEseq::gofTest(na.omit(round(high_var_exp)))
qq <- qqchisq(test, main="Chi2 Q-Q Plot", ylim = c(0, 60))

@taylordm @chinwallaa @jharenza @afarrel

To update - I had overlooked that the design had been specified as 0+harmonized_diagnosis, which eliminates a reference group comparison, specifies an intercept of 0, and returns results for whether gene expression of any group is significantly different from 0. This design requires a secondary contrast test explicitly specifying the two harmonized diagnoses to get a list of which genes are different between the two groups. Specifying the design as ~harmonized_diagnosis will use the first factor (e.g. H3 WT) as a reference group and the coefficient + p-value for H3K28 group will reflect a test for a significant difference in mean from the reference. Re-running the latter design returns ~5000 differentially expressed genes.

When trying to resolve this issue, I had also resorted to testing DESeq2's 'best practices' with RSEM data by obtaining individual sample transcript/isoform-level counts from @zhangb1, importing with tximport, and converting to a deseq2 object for deseq2 to calculate a gene offset (which accounts for differences in average transcript length across samples). Using that approach, I obtained approximately 200 differentially expressed genes specifying the same design (code below).

The caveat with the latter approach is that tximport requires individual sample files. Whether we want to include that as a data deliverable for OpenPedCan is something to consider if we feel that the tximport approach is providing a better sensitivity/specificity balance than the standard gene-level analysis.

With that resolved, I will go on to test RUVseq for this use case using the RSEM gene-level matrix as this is our current standard for gene expression data. I'll leave this ticket open through the end of the week if there are any comments then close.


# load libraries
suppressPackageStartupMessages({
  library(optparse)
  library(tidyverse)
  library(edgeR)
  library(RUVSeq)
  library(EDASeq)
  library(uwot)
  library(DESeq2)
  library(tximport)
  library(tximportData)
})

# source functions
source('util/umap_plot.R')
source('util/edaseq_plot.R')
source('util/deseq2_pvals_histogram.R') # DESeq2 pval histograms
source('util/box_plots.R') # boxplots for samples
source('util/ruvg_test.R') # function to run RUVg

# define directories
## data dir

dir <- system.file("extdata", package = "tximportData")
root_dir = rprojroot::find_root(rprojroot::has_dir('.git'))
data_dir = file.path(root_dir, 'data', 'v11')

## output dirs

umap_output_dir <- file.path('output', 'pbta_hgg_test', 'umap')
dge_output_dir <- file.path('output', 'pbta_hgg_test', 'dge')
plot_dir <- file.path('output', 'pbta_hgg_test', 'plots')

dir.create(dge_output_dir, showWarnings = F, recursive = T)
dir.create(umap_output_dir, showWarnings = F, recursive = T)
dir.create(plot_dir, showWarnings = F, recursive = T)

# histology file
hist_file <- read.delim(file.path(data_dir, 'histologies.tsv'))
hist.hgg <- hist_file %>%
  filter(experimental_strategy == "RNA-Seq",
         cohort %in% "PBTA", 
         harmonized_diagnosis %in% c('High-grade glioma/astrocytoma, H3 wildtype', 'Diffuse midline glioma, H3 K28-mutant'))

# gencode reference

gencode_gtf <- rtracklayer::import(con = file.path(root_dir, 'data', 'gencode.v27.primary_assembly.annotation.gtf.gz'))
gencode_gtf <- as.data.frame(gencode_gtf)
gencode_pc <- gencode_gtf %>%
  dplyr::select(gene_id, gene_name, gene_type) %>%
  filter(gene_type == "protein_coding") %>%
  unique()
# read transcript count data

file.metadata <- readr::read_tsv('~/Documents/temp_hgg_dmg_openpedcan/manifest_20220801_120351.tsv')
files <- file.metadata$name
files <- file.path('~/Documents/temp_hgg_dmg_openpedcan', files)
names(files) <- file.metadata$`Kids First Biospecimen ID`

# obtain transcript to gene mappings
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
txi.rsem <- tximport(files, type = "rsem", txIn = TRUE, txOut = TRUE)

# remove _geneName and retain only transcript IDs for tximport to map transcripts to genes
rownames(txi.rsem[['counts']]) = gsub('_.+', '', rownames(txi.rsem[['counts']]))
rownames(txi.rsem[['length']]) = gsub('_.+', '', rownames(txi.rsem[['length']]))
rownames(txi.rsem[['abundance']]) = gsub('_.+', '', rownames(txi.rsem[['abundance']]))
txi.rsem <- tximport::summarizeToGene(txi.rsem, tx2gene = tx2gene)

# build DESeq2 dataset from tximport and specify design. 

harmonized_diagnosis <- factor(as.character(hist.hgg$harmonized_diagnosis))
design <- model.matrix(~harmonized_diagnosis)
bs_id <- hist.hgg$Kids_First_Biospecimen_ID
RNA_library = hist.hgg$RNA_library
dds <- DESeqDataSetFromTximport(txi.rsem, hist.hgg, design = design)

# 1. DESeq2::DESeq performs a default analysis through the steps:
# - estimation of size factors: estimateSizeFactors
# - estimation of dispersion: estimateDispersions
# - Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest

dds <- DESeq2::DESeq(dds)
dge_output <- DESeq2::results(dds, cooksCutoff = FALSE, pAdjustMethod = 'BH')
dge_output <- dge_output %>% 
  as.data.frame() %>% 
  rownames_to_column('gene') %>%
  arrange(padj) 

ind = which(dge_output$padj < 0.05)
dge_output.f = dge_output[ind,]
dge_output.f = dge_output.f[which(dge_output.f$gene %in% gencode_pc$gene_id),]
dge_output.f = merge(dge_output.f, gencode_pc, by.x = 'gene', by.y = 'gene_id')

d3b-center / ticket-tracker-OPC