gagneurlab / FRASER

FRASER - Find RAre Splicing Events in RNA-seq
MIT License
36 stars 20 forks source link

Questions about optimal q and implementation="AE" vs. "PCA" #12

Open bw2 opened 4 years ago

bw2 commented 4 years ago

I understand using different values of q can significantly affect results, so I wanted to ask whether I'm choosing q and the implementation arg correctly.

My dataset is 54 affected male fibroblast samples.

1) One thing I'm curious about is why the optimal q's found by FRASER optimHyperParams (psi5_q = 5, psi3_q = 2, psiSite_q = 2) are much smaller than the q found by OUTRIDER findEncodingDim (q=12) for the same set of samples, and also smaller than the q used in the colab notebook example with 50 samples (q=10)?

I'm using FRASER to calculate counts for splice junctions, and STARv2 to get gene counts for OUTRIDER.

When I run

for(i in c("psi5", "psi3", "psiSite")) {
    fds <- optimHyperParams(fds, i, plot=FALSE, implementation="PCA", BPPARAM=bpparam)
    plotEncDimSearch(fds, type=i, plotType="auc")
    plotEncDimSearch(fds, type=i, plotType="loss")
}}

I get plots that look like these (the plots for psi5, psi3, and psiSite look nearly identical): fibroblasts_M_without_GTEX__54_samples_0667E4B6A8_plotEncDimSearch_psi3_AUC fibroblasts_M_without_GTEX__54_samples_0667E4B6A8_plotEncDimSearch_psi3_loss

and this is the plot from OUTRIDER plotEncDimSearch(ods) for the same set of samples:

fibroblasts_M_without_GTEX__plotEncDimSearch

2) From the pre-print, I understood that auto-encoder correction performs much better than PCA, so I run FRASER with implementation="AE". Is it fine to run optimHyperParams(..) with the default implementation="PCA", and use those q's for FRASER with implementation="AE"?

3) I tried running optimHyperParams(..) with implementation="AE" (and other values taken from the source code) but got validation errors saying these are not recognized implementations:

> fds = optimHyperParams(fds, 'psi5', implementation="AE", plot=FALSE)
Error in needsHyperOpt(implementation) : Method not found: 'AE'!

> fds = optimHyperParams(fds, 'psi5', implementation="FRASER", plot=FALSE)
dPsi filter:FALSE: 1307001  TRUE: 57861
Exclusion matrix: FALSE: 1335180    TRUE: 29682
Thu Jul  9 02:02:00 2020: Injecting outliers: 9001 / 800 (primary/secondary)
Thu Jul  9 02:02:05 2020: Run hyper optimization with 13 options.
1 ;  2 ;     0
Thu Jul  9 02:02:06 2020 ; q: 2 ; noise:  0
Error: BiocParallel errors
  element index: 1, 2, 3, 4, 5, 6, ...
  first error: 'arg' should be one of “PCA”, “PCA-BB-Decoder”, “AE”, “AE-weighted”, “PCA-BB-full”, “fullAE”, “PCA-regression”, “PCA-reg-full”, “PCA-BB-Decoder-no-weights”, “BB”

4) Given that optimal q computed by optimHyperParams for psi5, psi3, psiSite are sometimes not the same, what is the recommended way to choose which q to pass to FRASER? (currently I'm using q=max(bestQ(fds, type="psi5"), bestQ(fds, type="psi3"), bestQ(fds, type="psiSite")))

After glancing at the source code, I also tried

q = list(psi5=bestQ(fds, type="psi5"), psi3=bestQ(fds, type="psi3"), psiSite=bestQ(fds, type="psiSite"))
fds <- FRASER(fds, q=q, implementation ="AE",  BPPARAM=MulticoreParam(4))

but I got this error:

Wed Jul  8 18:01:50 2020: Fit step for: 'psi5'.
Wed Jul  8 18:01:50 2020: Running fit with correction method: AE

FALSE  TRUE 
97078 30265 
Error in 1:nPcs : NA/NaN argument

I realize now I can separately run fit and the other methods inside FRASER, and pass in a different q to each. Is this recommended?

ischeller commented 4 years ago

Dear @bw2 , thank you for your feedback.

  1. Regarding your first question, it is expected to find a smaller latent space dimension in FRASER compared to OUTRIDER, as the splice metrics are count ratios that contain some implicit normalization already.

  2. We probably need to clarify this point in the pre-print. Also for the pre-print, we ran FRASER with implementation="PCA", which we found give comparably good results to the real AE implementation, but is much faster and therefore we currently use PCA as the default in FRASER. We then compared our beta-binomial p-value based outlier detection approach with a z-score based outlier detection approach and show that the p-value based approach performs much better. So for you, using implementation="PCA" and then calling outliers based on the p-values should be completely fine (in general we also recommend to additionally use a cutoff on delta psi to focus on significant outliers with a large effect size). However, if you want to use our AE implementation, I would suggest you to use implementation="PCA-BB-Decoder", which is a hybrid approach between PCA and the full AE. It uses a PCA to estimate the encoder matrix (to speed up this step) and then uses the BB loss function to only optimize the weights in the decoder matrix. I hope this explanation helps to make this point more clear.

  3. Thank you for spotting and reporting this. This is indeed a bug that shoulded be fixed soon. In the meantime, see my answer to point 2.

  4. Try passing a named vector as the q argument to the FRASER function, like this: q=c(psi5=bestQ(fds, type="psi5"), psi3=bestQ(fds, type="psi3"), psiSite=bestQ(fds, type="psiSite"))

bw2 commented 4 years ago

Hi Ines, thanks very much for the detailed replies. I've now tried running with implementation="PCA" and compared it to implementation="AE" which I ran previously. It's significantly faster as expected, but I'm seeing close to 2x more results at FDS 0.05 when using PCA. I will keep in mind to also try PCA-BB-Decoder.

drewjbeh commented 7 months ago

Hi @bw2. Just following up on this...were you able to run the optimization with different implementations without errors? Are there further questions?

bw2 commented 7 months ago

Hi @drewjbeh , we have now switched to FRASERv2 (and I've personally moved on to other projects).