gabrielodom / pathwayPCA

integrative pathway analysis with modern PCA methodology and gene selection
https://gabrielodom.github.io/pathwayPCA/
11 stars 2 forks source link

Add regression example with PCs and covariates to the vignette #28

Closed gabrielodom closed 5 years ago

gabrielodom commented 5 years ago

Use data from Chen's original paper or placenta data. Get clarifying information from Steven.

gabrielodom commented 5 years ago

Three examples:

  1. Controlling covariates
  2. Pathway-based prediction
  3. Multi-omics signatures
lxw391 commented 5 years ago

Notes from 25 September Meeting

four examples:

  1. ovarian PNNL pathway testing
  2. testing interaction effect: kidney cancer has a sex effect (M v F) see Han Liang's paper (Comprehensive Characterization of Molecular Differences in Cancer between Male and Female Patients)
    • dataset: KIRP TCGA RNAseq dataset (normalized log2 transformed RSEM -- ask Antonio)
    • pathway collection: Wikipathways and C2:CP (canonical pathways)
    • extract aesPCs first, merge with gender + survival outcome, censoring info (ask Antonio)
    • fit model survival ~ PC1 + SEX + PC1 * SEX for each pathway
    • return a data frame with pathways as rows and model fit statistics as columns (coefficients, p-values, model F-statistic, etc)

Continued below

gabrielodom commented 5 years ago

@jamesban2015, please help me find (and clean if necessary) the KIRP TCGA RNAseq dataset and get the matching survival outcome and censoring info. Thanks

gabrielodom commented 5 years ago

Example 2 data: Use the KIRP data from: https://xenabrowser.net/datapages/?dataset=TCGA.KIRP.sampleMap%2FHiSeqV2&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

Phenotype: https://xenabrowser.net/datapages/?dataset=TCGA.KIRP.sampleMap%2FKIRP_clinicalMatrix&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

gabrielodom commented 5 years ago

Continued Examples:

  1. Multi-omics signatures for ovarian PNNL. Bing's group did individual-feature testing against survival. DOI: 10.1093/nar/gkx1090. Do pathway analysis (overall survival) using the following data:

Copy-number: https://xenabrowser.net/datapages/?dataset=TCGA.OV.sampleMap%2FGistic2_CopyNumber_Gistic2_all_data_by_genes&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

Phenotype: https://xenabrowser.net/datapages/?dataset=TCGA.OV.sampleMap%2FOV_clinicalMatrix&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

Find the overlap between the significant pathways returned by the copy-number pathway PCA and the significant pathways from the ovarian PNNL pathway PCA (overall survival). Repeat this for C2CP, CP:KEGG (C5GO), and Wikipathways.

gabrielodom commented 5 years ago
  1. Prediction. Use Colorectal cancer gene expression: https://xenabrowser.net/datapages/?dataset=TCGA.COADREAD.sampleMap%2FHiSeqV2&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 Phenotype: https://xenabrowser.net/datapages/?dataset=TCGA.COADREAD.sampleMap%2FCOADREAD_clinicalMatrix&host=https%3A%2F%2Ftcga.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443 Perform following steps:

    i. Split the data into 50-50% testing and training ii. Perform AES- or Supervised PCA, extract 1 PC from each pathway from training set iii. Multiply the loadings from the training data by the pathway-specific testing design matrices to yield testing PCs iv. Use the PCs extracted from the training data to train and cross-validate an elastic-net model (glmnet: use defaults for CV). Store this model. v. Predict the testing survival using the PCs from the testing data. vi. Using the predicted test survival, compare the predicted survival to the true survival with a survival ROC curve.

gabrielodom commented 5 years ago

Depends on Issue #35.

gabrielodom commented 5 years ago

For example 4, try the following:

  1. Center and scale the test and train data independently (use Issue #37)
  2. Add the full C5 pathway collection (http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C5)
  3. Change the alpha value from 1 to 0.1, 0.2, ..., 0.9, 1 in cv.glmnet()
  4. Try SuperPCA_pVals().
gabrielodom commented 5 years ago

Issue #37 is now closed. Moving forward with this re-vamped analysis.

gabrielodom commented 5 years ago

Completing the C5 analysis requires Issue #43 to be closed.

gabrielodom commented 5 years ago

I've tested the prediction results for the independently-scaled training and test data: no performance increase. I've also tested the C5 pathway collection: no performance increase.

gabrielodom commented 5 years ago

I've tried two sequences of alpha (0.1, 0.2, ..., 1; 0.01, 0.04, 0.09, 0.16, ..., 1). Smaller values of alpha yielded the "best" performance, but it was still abysmal.

jamesban2015 commented 5 years ago

Gabriel, can you provide some details of the performance? Is there a figure or markdown for the performance evaluation?

gabrielodom commented 5 years ago

Completing the SuperPCA analysis requires Issue #44 to be closed.

gabrielodom commented 5 years ago

@jamesban2015 See Rmarkdown and .html reports in the Example Data/Xena Prediction Colorectal directory.

gabrielodom commented 5 years ago

Completing the SuperPCA analysis requires Issue #45 to be closed.

gabrielodom commented 5 years ago

Results for Supervised PCA are in Example Data/Xena Prediction Colorectal/SuperPCA_Prediction3.html. It's not good.

jamesban2015 commented 5 years ago

did you try predict the training data instead of testing data?

gabrielodom commented 5 years ago

One issue was that I did not center and scale the test data before loading it on the PCs calculated from the training data. This is related to issue #37, which I've re-opened. Basically, even though I selected to not center and scale the training data, the internal PCA routine scaled the data anyway. After I fix that issue, I want to try with the raw training and test data for both AESPCA and SuperPCA.

gabrielodom commented 5 years ago

The cox prediction does not return survival times directly: http://r.789695.n4.nabble.com/estimating-survival-times-with-glmnet-and-coxph-td4614225.html

Look at approaches 2 and 3 here: http://gaodoris.blogspot.com/2012/10/5-ways-to-estimate-concordance-index.html

gabrielodom commented 5 years ago

That's because it doesn't make sense to measure how well a survival prediction is performing based on individual survival times. Predicting survival time is apparently difficult (if not impossible) in the CoxPH framework.

  1. Survival prediction doesn't work at the patient level: http://dx.doi.org/10.1136/jme.2005.012427
  2. Cox PH models cannot predict survival time: https://stats.stackexchange.com/questions/79362/how-to-get-predictions-in-terms-of-survival-time-from-a-cox-ph-model
  3. Try to figure out Harrel's c instead: https://stats.stackexchange.com/questions/116540/how-to-evaluate-the-goodness-of-fit-for-survial-functions
gabrielodom commented 5 years ago

For example 3, compare C2 under SuperPCA and AESPCA. Look at the genes internal to the shared significant pathways for these two techniques. Can we tell a story?

gabrielodom commented 5 years ago

For example 3, the shared genes are shown in Xena Multi-Omics Ovarian/Reports/summary_ovarian_multiomics.html.

lxw391 commented 5 years ago

I think you're pulling out genes that exist in both copy number data and proteomics data. Could you pull out genes with non-zero coefficients in AES-PCA in both copy number and proteomics data? These would be the genes that contribute to pathway significance and the ones we are interested.

gabrielodom commented 5 years ago

For example 2, the significant pathways are shown in Xena Interaction Kidney/Reports/KIRP_Sex_PC_Interaction.html

gabrielodom commented 5 years ago

For @lxw391's comment on example 3: I've updated the multi-omics report to include the overlap of the genes from significant pathways which also had non-zero loadings. This is in Xena Multi-Omics Ovarian/Reports/summary_ovarian_multiomics.html

gabrielodom commented 5 years ago

For the vignettes, include a section showing the user querying an online data repository for data. We don't want to include the KIRP, copy number, or ovarian PNNL data in the package itself unless we have to.

gabrielodom commented 5 years ago

Move conversation to Issue #49.