kgaonkar6 commented 3 years ago

What are the scientific goals of the analysis?

Determine stably expressed transcripts between polyA and stranded samples.
Check if the DGE analysis between polya and stranded data follow a null-p-value distribution.

What methods do you plan to use to accomplish the scientific goals?

Perform DGE analyses in polya and stranded datasets:

Expected output:

TSV of stably expressed transcripts
Histograms showing the distributions of the p-values for the lrt and exactTests.
Number of differentially expressed genes reported for the lrt and exactTests

What input data are required for this analysis?

Modified from https://github.com/d3b-center/bixu-tracker/issues/884 as described by @aadamk (since it seems @logstar doesn't have access to that ticket)

Download files to data folder in your forked repo of PediatricOpenTargets/OpenPBTA-analysis https://s3.amazonaws.com/kf-openaccess-us-east-1-prd-pbta/data/release-v19-20210423/pbta-gene-counts-rsem-expected_count.polya.rds https://s3.amazonaws.com/kf-openaccess-us-east-1-prd-pbta/data/release-v19-20210423/pbta-gene-counts-rsem-expected_count.stranded.rds
Collapse to genename level You can use code from here bash run-collapse-rnaseq.sh -p pbta -l stranded -x counts -q expected_count bash run-collapse-rnaseq.sh -p pbta -l polya -x counts -q expected_count

Filter common genes in polya and stranded data and select the above samples:


polya <- readRDS("<collapse_rnaseq/results folder>/pbta-gene-counts-rsem-expected_count-collapsed.polya.rds") 
stranded <- readRDS("<collapse_rnaseq/results folder>/pbta-gene-counts-rsem-expected_count-collapsed.stranded.rds")

Index | Kids_First_Biospecimen_ID | sample_id | experimental_strategy | RNA_library | cohort

-- | -- | -- | -- | -- | --

1 | BS_HE0WJRW6 | 7316-1455 | RNA-Seq | stranded | CBTN

2 | BS_HWGWYCY7 | 7316-1455 | RNA-Seq | poly-A | CBTN

3 | BS_SHJA4MR0 | 7316-161 | RNA-Seq | stranded | CBTN

4 | BS_X0XXN9BK | 7316-161 | RNA-Seq | poly-A | CBTN

5 | BS_FN07P04C | 7316-255 | RNA-Seq | stranded | CBTN

6 | BS_W4H1D4Y6 | 7316-255 | RNA-Seq | poly-A | CBTN

7 | BS_8QB4S4VA | 7316-536 | RNA-Seq | stranded | CBTN

8 | BS_QKT3TJVK | 7316-536 | RNA-Seq | poly-A | CBTN

9 | BS_7WM3MNZ0 | A16915 | RNA-Seq | poly-A | PNOC003

10 | BS_KABQQA0T | A16915 | RNA-Seq | stranded | PNOC003

11 | BS_68KX6A42 | A18777 | RNA-Seq | poly-A | PNOC003

12 | BS_D7XRFE0R | A18777 | RNA-Seq | stranded | PNOC003

group <- factor(c(rep("stranded",6),rep("polya",6)))

common_gene_id <- intersect(rownames(stranded),rownames(polya))

counts <- cbind(stranded[common_gene_id,c("BS_HE0WJRW6","BS_SHJA4MR0","BS_FN07P04C","BS_8QB4S4VA","BS_KABQQA0T","BS_D7XRFE0R")], polya[common_gene_id,c("BS_HWGWYCY7","BS_X0XXN9BK","BS_W4H1D4Y6","BS_QKT3TJVK","BS_7WM3MNZ0","BS_68KX6A42")])


4. DGE as below modified form of edgeR's differential expression pipeline for the case of no replicates. This will involve the following
Housekeeping gene list can be found  [here](https://housekeeping.unicamp.br/Housekeeping_GenesHuman.csv), derived from the HRT atlas analysis on GTEx datasets. It seems to me the ensembl ids are from a different ENSEMBL version so you can probably can use  the genename only.

Build DGEList

counts_object = DGEList(counts = y, group=group).

Filter by gene expression and normalize by TMM

keep = filterByExpr(counts_object) counts_object <- counts_object[keep,,keep.lib.sizes=FALSE] counts_object <- calcNormFactors(counts_object)

Copy normalized DGEList object (new object counts_object_1) and replace group variable with only one treatment group

counts_object_1=counts_object counts_object_1$samples$group <- 1

Estimate common dispersion from a set of housekeeping genes (linked below).

y0 <- estimateDisp(counts_object_1[housekeeping,], trend="none", tagwise=FALSE) counts_object$common.dispersion <- y0$common.dispersion

Make design matrix

design <- model.matrix(~Groups) fit <- glmFit(counts_object, design) lrt <- glmLRT(fit)


UQ-pgQ2 normalization functions are below. UQ-pgQ2 normalization is described here. Note - this is a per-sample upper-quartile global scaling at 75 percentile followed by a per-gene Q2 normalization:

Functions

X is a matrix of data

Define upper quartile (UQ) normalization method

UQ<-function(X){

uq<-function(y){ quantile(y, 0.75) } X<-X+0.1 upperQ<-apply(X,2,uq) f<-upperQ/mean(upperQ) # calculate normalization factor res<-scale(X,center=FALSE,scale=f) return(res) }

per gene normalization by Median: pgQ2

X: a matrix of data with the multiplication of factor (f) as:

f=100 as a default

uq.pgQ2<-function(X, f){

uq.res<-UQ(X) #  perform UQ normalization per sample

## perform per gene nomralization

m<-apply(uq.res,1, median)

idx<-which(m==0) # avoid 0 median m[idx]=0.1

si<-m/f # calculate normalization factor X1<-scale(t(uq.res),center=FALSE,scale=si) res<-t(X1) return(res)
}



#### How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

3-5 days

#### Who will complete the analysis (please add a GitHub handle here if relevant)?
@logstar 

#### What relevant scientific literature relates to this analysis?
@jharenza @aadamk

jharenza commented 3 years ago

@logstar I think this is a good first ticket which you can start on when you are able!

logstar commented 3 years ago

@logstar I think this is a good first ticket which you can start on when you are able!

@jharenza thank you. I will create a new module for this analysis.

logstar commented 3 years ago

@kgaonkar6 thank you for the detailed analysis description.

I was wondering if there is any recommended method in OpenPBTA or d3b to identify stably expressed transcripts. I cannot see how the d3b single-sample DGE issue was resolved, because I do not have access to the repository.

I am also planning to use edgeR for the DGE analysis. If you have any other recommended method, I will switch to the recommended one instead.

kgaonkar6 commented 3 years ago

@logstar the methods are updated now. Your question about "stably expressed transcripts" I think @jharenza or @aadamk can provide the logFC and pvalue thresholds.

@aadamk I updated the ticket to add more information for methods you described from https://github.com/d3b-center/bixu-tracker/issues/884 does that looks ok?

kgaonkar6 commented 3 years ago

Also adding @komalsrathi for her input on if there is code used previously . I believe similar analysis was implemented in PNOC008 ?

logstar commented 3 years ago

@logstar the methods are updated now. Your question about "stably expressed transcripts" I think @jharenza or @aadamk can provide the logFC and pvalue thresholds.

@kgaonkar6 thank you for the detailed method descriptions and code. I will use them to create this analysis module.

komalsrathi commented 3 years ago

@logstar

The script that addresses the reference issue is here . This was done for single sample tumor sample vs GTEx Brain samples so you might have to make some minor changes here but overall should work.

cc: @aadamk please confirm if this is what needs to be implemented.

logstar commented 3 years ago

@komalsrathi thank you for the pointers.

logstar commented 3 years ago

@jharenza

I have created a new module rna-seq-protocol-dge for this analysis in my own fork of PediatricOpenTargets/OpenPBTA-analysis, https://github.com/logstar/OpenPBTA-analysis/tree/rna-seq-protocol-dge/analyses/rna-seq-protocol-dge

I have completed the following analysis steps:

Collapse RSEM expected count matrices of poly-A and stranded libraries using the OpenPBTA-analysis/analyses/collapse-rnaseq module.
Run edgeR DGE LRT and exactTest on the 12 selected libraries to compare poly-A and stranded RNA-seq protocols, using the collapsed RSEM expected count matrices and code adapted from https://github.com/d3b-center/OMPARE/blob/master/code/patient_level_analyses/utils/rnaseq_edger_normalizations.R .

Next, I will select stably expressed transcripts in the stranded and poly-A libraries using logFC and p-values.

If you have any questions or would like to add anything, feel free to let me know.

jharenza commented 3 years ago

thanks @logstar - Great! If possible, can you break this up as noted in the multi-step PR model here and start submitting PRs? This will allow us to review smaller chunks of code faster.

logstar commented 3 years ago

thanks @logstar - Great! If possible, can you break this up as noted in the multi-step PR model here and start submitting PRs? This will allow us to review smaller chunks of code faster.

Sure. Thank you for the pointer. I will submit multi-step PRs to the PediatricOpenTargets/OpenPBTA-analysis for this analysis.

aadamk commented 3 years ago

Thank you for linking in the methods @kgaonkar6 and @komalsrathi (just getting to this as 003 + CBTN took priority late last week).

To help weigh in @logstar, the above TMM and UQ-pgQ2 code should be sufficient for benchmarking DGE analyses for matched samples differing by library enrichment method. The latter normalization is expected to have fewer false positives for balanced experimental designs, whereas the former does not depend on balanced designs (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-6502-7). The only modification that I have would be to add the QL F-test as part of this analysis, as this combines well with UQ-pgQ2 for small sample sizes in terms of control of the FP rate.

One caveat in the above code which I'm sure you've noticed is the assignment of groups, as the above copied code assigns groups suitable for comparison of a single sample to a large cohort (for an N of 1 trial that is currently ongoing). So long as that is adjusted accordingly, the downstream estimation of dispersion + differential expression tests should then be straightforward. So there would be no need to copy the normalized DGEList object as you see above, but you can instead just define the groups from the beginning then proceed with estimation of dispersions and down stream testing.

Also modifying the expected output, which would be: Expected output:

TSV of stably expressed transcripts
Histograms showing the distributions of the p-values for the lrt, exactTests, and QL F Test.
Number of differentially expressed genes reported for the lrt, exactTests, and QL F Test.
Number of differentially expressed genes reported for the lrt, exactTests, and QL F Test.
DEG output txt files showing differentially and non-differentially expressed genes for lrt, exactTests, and QL F Test.

Also linking in the discussion from issue 10 here to keep you up to speed: https://github.com/PediatricOpenTargets/ticket-tracker/issues/10.

Can you also include me in the PRs as a reviewer @logstar?

logstar commented 3 years ago

@aadamk Thank you for the detailed description.

I will add UQ-pgQ2 QL F test in this analysis module and generate the modified list of expected output. I will also remove the unnecessary steps in the estimation of dispersions and down stream testing.

I will include you in the PRs as a reviewer.

jharenza commented 3 years ago

closed with #10 and #11

d3b-center / ticket-tracker-OPC

Proposed Analysis: DGE analysis of matched polya and stranded samples #17

What are the scientific goals of the analysis?

What methods do you plan to use to accomplish the scientific goals?

What input data are required for this analysis?

Index | Kids_First_Biospecimen_ID | sample_id | experimental_strategy | RNA_library | cohort

-- | -- | -- | -- | -- | --

1 | BS_HE0WJRW6 | 7316-1455 | RNA-Seq | stranded | CBTN

2 | BS_HWGWYCY7 | 7316-1455 | RNA-Seq | poly-A | CBTN

3 | BS_SHJA4MR0 | 7316-161 | RNA-Seq | stranded | CBTN

4 | BS_X0XXN9BK | 7316-161 | RNA-Seq | poly-A | CBTN

5 | BS_FN07P04C | 7316-255 | RNA-Seq | stranded | CBTN

6 | BS_W4H1D4Y6 | 7316-255 | RNA-Seq | poly-A | CBTN

7 | BS_8QB4S4VA | 7316-536 | RNA-Seq | stranded | CBTN

8 | BS_QKT3TJVK | 7316-536 | RNA-Seq | poly-A | CBTN

9 | BS_7WM3MNZ0 | A16915 | RNA-Seq | poly-A | PNOC003

10 | BS_KABQQA0T | A16915 | RNA-Seq | stranded | PNOC003

11 | BS_68KX6A42 | A18777 | RNA-Seq | poly-A | PNOC003

12 | BS_D7XRFE0R | A18777 | RNA-Seq | stranded | PNOC003

Build DGEList

Filter by gene expression and normalize by TMM

Copy normalized DGEList object (new object counts_object_1) and replace group variable with only one treatment group

Estimate common dispersion from a set of housekeeping genes (linked below).

Make design matrix

Functions

X is a matrix of data

Define upper quartile (UQ) normalization method

per gene normalization by Median: pgQ2

X: a matrix of data with the multiplication of factor (f) as:

f=100 as a default