LieberInstitute / zandiHyde_bipolar_rnaseq

RNA-seq analysis for psychENCODE R01
6 stars 1 forks source link

Variability in gene expression between sACC and AMYG #13

Closed lcolladotor closed 3 years ago

lcolladotor commented 3 years ago
  1. Regarding the difference between the number of differentially expressed features in sACC and amygdala, are expression levels (of any/all features) more variable in the sACC than the amygdala, and could this higher variance drive the greater discovery in the sACC?

Here I think that the simplest plot to make is a scatter plot (like in #8) between the rowSds(RPKM) computed for each region. genefilter and matrixStats implement that function.

lcolladotor commented 3 years ago

https://rdrr.io/cran/matrixStats/man/rowSds.html

This is a bit more complicated since it we'll have to do for every feature level like in #8. So RPKMs for genes and exons, RP80M (or was it RP10M?) for exon-exon junctions and TPMs for transcripts.

andrewejaffe commented 3 years ago

It's probably worth running levene test or another test of differential variation comparing an equal number of controls from both brain regions. Maybe on log rpkm scale and then on the cleaningY output of full model (sans Dx)?

On Thu, Apr 8, 2021, 2:34 PM Leonardo Collado-Torres < @.***> wrote:

https://rdrr.io/cran/matrixStats/man/rowSds.html

This is a bit more complicated since it we'll have to do for every feature level like in #8 https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/issues/8. So RPKMs for genes and exons, RP80M (or was it RP10M?) for exon-exon junctions and TPMs for transcripts.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/issues/13#issuecomment-816048295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYY2ENSJHXSKKVQ4P3DBLTHXZLLANCNFSM42TMKJSQ .

lcolladotor commented 3 years ago

I like the idea of regressing out all the variability from the terms (covariates) we adjust for in the models, keeping only the Dx.

I didn't remember the levene test https://en.wikipedia.org/wiki/Levene%27s_test, so that's a good idea =). Looks like car::leveneTest() can do the job http://www.sthda.com/english/wiki/compare-multiple-sample-variances-in-r.

lcolladotor commented 3 years ago

So, for every gene, extract expr from cleaningY(log2(rpkm + 1), mod = joint_model, P = 3) where we keep the Dx and Region effects.

then use:

library(car)
leveneTest(expr ~ Region * Dx, data = data_for_one_gene)

## Extract the F-value and P-value from the leveneTest() output

also keep the results from the Region and Dx only runs

leveneTest(expr ~ Region, data = data_for_one_gene)
leveneTest(expr ~ Dx, data = data_for_one_gene)
lcolladotor commented 3 years ago

FYI I just finished editing my previous comment

andrewejaffe commented 3 years ago

/users/ajaffe/Lieber/lieber_functions_aj.R #1938 leveneFast()

On Tue, Apr 20, 2021 at 2:11 PM Leonardo Collado-Torres < @.***> wrote:

FYI I just finished editing my previous comment

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LieberInstitute/zandiHyde_bipolar_rnaseq/issues/13#issuecomment-823493615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYY2ALFPZ7Z3WNSQVLR2DTJW7VHANCNFSM42TMKJSQ .

lcolladotor commented 3 years ago

OHHH, thanks!

leveneFast <- function(dat, groups) {
    require(matrixStats)
    N = length(groups)
    k = length(unique(groups))

    gIndexes = split(seq(along=groups),groups)
    gLens = as.numeric(sapply(gIndexes,length))

    zij = dat
    for(i in seq(along=gIndexes)) {
        ind = gIndexes[[i]]
        zij[,ind] = abs(dat[,ind] - rowMedians(dat[,ind]))
    }

    zdotdot = rowMeans(zij)
    zidot = sapply(gIndexes, function(x) rowMeans(zij[,x]))

    num = rowSums(gLens*(zidot-zdotdot)^2)

    out = zidot
    for(i in seq(along=gIndexes)) {
        ind = gIndexes[[i]]
        out[,i] = rowSums((zij[,ind] - zidot[,i])^2)
    }
    denom = rowSums(out)

    W = ((N-k)/(k-1))*(num/denom)
    pval = pf(W, df1 = k-1, df2 = N-k, lower.tail = FALSE)

    return(data.frame(cbind(W,pval)))
}

Louise, I see that groups would need to be a factor of length equal to the number of columns of dat (so 1 value per sample). So for Dx * Region it can be done with groups = factor(paste0(Dx, "_", Region)) or something like that. dat would be the output from cleaningY(). You'll need to have matrixStats installed too.