Closed aadamk closed 2 years ago
Is there a way to see if you observe this using raw reads/counts instead of expected counts?
Hi @afarrel - I'll check with the bix dev team regarding whether a matrix of raw counts are available.
The vignette also mentions that RSEM transcript abundances can be used provided that offsets for gene level abundances are calculated using tximport. Could go that route.
To update - I had overlooked that the design had been specified as 0+harmonized_diagnosis
, which eliminates a reference group comparison, specifies an intercept of 0, and returns results for whether gene expression of any group is significantly different from 0. This design requires a secondary contrast
test explicitly specifying the two harmonized diagnoses to get a list of which genes are different between the two groups. Specifying the design as ~harmonized_diagnosis
will use the first factor (e.g. H3 WT) as a reference group and the coefficient + p-value for H3K28 group will reflect a test for a significant difference in mean from the reference. Re-running the latter design returns ~5000 differentially expressed genes.
When trying to resolve this issue, I had also resorted to testing DESeq2's 'best practices' with RSEM data by obtaining individual sample transcript/isoform-level counts from @zhangb1, importing with tximport, and converting to a deseq2 object for deseq2 to calculate a gene offset (which accounts for differences in average transcript length across samples). Using that approach, I obtained approximately 200 differentially expressed genes specifying the same design (code below).
The caveat with the latter approach is that tximport requires individual sample files. Whether we want to include that as a data deliverable for OpenPedCan is something to consider if we feel that the tximport approach is providing a better sensitivity/specificity balance than the standard gene-level analysis.
With that resolved, I will go on to test RUVseq for this use case using the RSEM gene-level matrix as this is our current standard for gene expression data. I'll leave this ticket open through the end of the week if there are any comments then close.
# load libraries
suppressPackageStartupMessages({
library(optparse)
library(tidyverse)
library(edgeR)
library(RUVSeq)
library(EDASeq)
library(uwot)
library(DESeq2)
library(tximport)
library(tximportData)
})
# source functions
source('util/umap_plot.R')
source('util/edaseq_plot.R')
source('util/deseq2_pvals_histogram.R') # DESeq2 pval histograms
source('util/box_plots.R') # boxplots for samples
source('util/ruvg_test.R') # function to run RUVg
# define directories
## data dir
dir <- system.file("extdata", package = "tximportData")
root_dir = rprojroot::find_root(rprojroot::has_dir('.git'))
data_dir = file.path(root_dir, 'data', 'v11')
## output dirs
umap_output_dir <- file.path('output', 'pbta_hgg_test', 'umap')
dge_output_dir <- file.path('output', 'pbta_hgg_test', 'dge')
plot_dir <- file.path('output', 'pbta_hgg_test', 'plots')
dir.create(dge_output_dir, showWarnings = F, recursive = T)
dir.create(umap_output_dir, showWarnings = F, recursive = T)
dir.create(plot_dir, showWarnings = F, recursive = T)
# histology file
hist_file <- read.delim(file.path(data_dir, 'histologies.tsv'))
hist.hgg <- hist_file %>%
filter(experimental_strategy == "RNA-Seq",
cohort %in% "PBTA",
harmonized_diagnosis %in% c('High-grade glioma/astrocytoma, H3 wildtype', 'Diffuse midline glioma, H3 K28-mutant'))
# gencode reference
gencode_gtf <- rtracklayer::import(con = file.path(root_dir, 'data', 'gencode.v27.primary_assembly.annotation.gtf.gz'))
gencode_gtf <- as.data.frame(gencode_gtf)
gencode_pc <- gencode_gtf %>%
dplyr::select(gene_id, gene_name, gene_type) %>%
filter(gene_type == "protein_coding") %>%
unique()
# read transcript count data
file.metadata <- readr::read_tsv('~/Documents/temp_hgg_dmg_openpedcan/manifest_20220801_120351.tsv')
files <- file.metadata$name
files <- file.path('~/Documents/temp_hgg_dmg_openpedcan', files)
names(files) <- file.metadata$`Kids First Biospecimen ID`
# obtain transcript to gene mappings
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
txi.rsem <- tximport(files, type = "rsem", txIn = TRUE, txOut = TRUE)
# remove _geneName and retain only transcript IDs for tximport to map transcripts to genes
rownames(txi.rsem[['counts']]) = gsub('_.+', '', rownames(txi.rsem[['counts']]))
rownames(txi.rsem[['length']]) = gsub('_.+', '', rownames(txi.rsem[['length']]))
rownames(txi.rsem[['abundance']]) = gsub('_.+', '', rownames(txi.rsem[['abundance']]))
txi.rsem <- tximport::summarizeToGene(txi.rsem, tx2gene = tx2gene)
# build DESeq2 dataset from tximport and specify design.
harmonized_diagnosis <- factor(as.character(hist.hgg$harmonized_diagnosis))
design <- model.matrix(~harmonized_diagnosis)
bs_id <- hist.hgg$Kids_First_Biospecimen_ID
RNA_library = hist.hgg$RNA_library
dds <- DESeqDataSetFromTximport(txi.rsem, hist.hgg, design = design)
# 1. DESeq2::DESeq performs a default analysis through the steps:
# - estimation of size factors: estimateSizeFactors
# - estimation of dispersion: estimateDispersions
# - Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest
dds <- DESeq2::DESeq(dds)
dge_output <- DESeq2::results(dds, cooksCutoff = FALSE, pAdjustMethod = 'BH')
dge_output <- dge_output %>%
as.data.frame() %>%
rownames_to_column('gene') %>%
arrange(padj)
ind = which(dge_output$padj < 0.05)
dge_output.f = dge_output[ind,]
dge_output.f = dge_output.f[which(dge_output.f$gene %in% gencode_pc$gene_id),]
dge_output.f = merge(dge_output.f, gencode_pc, by.x = 'gene', by.y = 'gene_id')
What data file(s) does this issue pertain to?
gene-counts-rsem-expected_count-collapsed.rds
What release are you using?
v11
Put your question or report your issue here.
Potential issue with inflated type I error when performing tumor-only differential gene expression analyses (though please verify whether DESeq2 design is correct).
When attempting to complete the batch correction module using RUVseq analysis on TARGET Neuroblastoma data (mycn amplified vs non-amplified) as a use case (comparing standard DGE to batch-corrected DGE), all genes were found to be differentially expressed by DESeq2:
This appears to be due to the fact that DESeq2 and EdgeR rely on fitting to a negative binomial distribution, and many gene features appear to violate this assumption (QQ plot below):
Potential solution: evaluate non-parametric DGE analysis methods that do not rely on gene-level distributional assumptions (e.g. NOISeq bioconductor package).
Code for testing HGG WT vs DMG H3K28M and generating qq plot:
@taylordm @chinwallaa @jharenza @afarrel