Create PCA dimension reduction example with RNA-seq data

cansavvy commented 4 years ago

Related to #110

Currently we have a microarray example of PCA. We should be able to largely copy the microarray example but switch out the dataset and the accompanying wording.

Ideally we will use the same dataset as we will use for UMAP RNA-seq example (#128) and ideally the dataset will not have too weird looking of a structure.

This can be done separately from the "getting started" additions to each analysis #116

cansavvy commented 4 years ago

I think the dimension reduction module would be a good opportunity to show DESeq2 strategy and non DESeq2 strategy side by side (See #103). Because PCAplot is a handy wrapper function, I think we could do this example as this set of steps: Download Non-Qn'ed data from refinebio -> create a DESEq2 object -> use DESeq2::PCAplot()

~~then in parallel we could do the UMAP example (#128) as a QN'ed example.~~

cansavvy commented 4 years ago

I've changed my mind. I think keeping both as the same DESeq2 normalized dataset where we show here how to use a DESeq2 function and how to use a non-DESeq2 function would be a nicely complementary set of examples. So here, we would use the DESeq2::PCAplot with a DESeq2 object and in the UMAP example (#128) we show how to extract the assay() data and use a non-DESeq2 function.

cansavvy commented 4 years ago

We should also include a comment about the base R PCA functions as an alternative (prcomp and/or princomp).

cbethell commented 4 years ago

As discussed with @cansavvy, the most logical plan for tackling this issue at this time seems to be as follows:

Set up - read in DESeq2 package
Import and set up data - find a large non-Qn’ed dataset that can be used for the PCA and UMAP RNA-seq example notebooks and read that in with its metadata (will post tentative datasets once found, keeping in mind that we want a dataset that has adequate accompanying metadata and that it may be more efficient to borrow a dataset that we have used in one of our previous analyses if it becomes time consuming to search for a new suitable dataset)
Create a DESeqDataset object
Define minimum counts cutoff
Normalize data - Note vst() is faster than rlog() (so we may switch to using vst() for this module is necessary and provide a link and brief explanation for why we would have made this decisionl)
Use plotPCA() function from DESeq2 to plot the PCA values

Note that we will plan on not saving the PC scores because both @cansavvy and I cannot think of a responsible downstream way in which users would utilize the PC scores other than to reimport them and plot again.

I plan to get started on this once I get the clustering RNA-seq example notebook PR cleaned up.

I will use the template in PR #146 as a basis for formatting this notebook.

cbethell commented 4 years ago

In searching for datasets for the notebook described in this ticket, I found the following datasets (which aren't personally satisfying but one of which may suit the purpose of this notebook):

SRP133573 Identification of Transcription Factor Relationships Associated with Androgen Deprivation Therapy Response and Metastatic Progression in Prostate Cancer

Taken from the associated paper, "Computational analyses identified transcription factor coordinated groups (TFCGs) enriched in the high impact group network. Leveraging a large public dataset of over 800 metastatic and primary samples, we identified 33 TFCGs in common between the high impact group and metastatic lesions, including SOX4/FOXA2/GATA4, and a TFCG containing JUN, JUNB, JUND, FOS, FOSB, and FOSL1".

With that information, I tried a different method of displaying the variables (and also tried using different variables) but the best result of my tests along with the code used is as follows:

Note that the idea for this dataset would be to add one annotation layer/variable at a time. So we would start with the most basic PCA plot without annotation, then add each variable one by one.

SRP163027 Differential analysis of gene expression profiles in peripheral blood cells of patients with cervical lesions

I am not a big fan of the plot but the metadata seems less complex than the first dataset. However, this experiment seems to be focused on the gene level as the abstract of the associated paper states that the goal is "to investigate whether cervical cancer (CC) and cervical intraepithelial neoplasia (CIN) can be screened by analyzing gene expression profiling of peripheral blood. RNA-sequencing analysis of blood was performed on 11 CC patients, 21 CIN patients and 19 healthy controls (H)."

As for the results of the experiment "there were significant differences in the expression levels of six genes between CC and H, five genes between CIN and H and four genes between CC and CIN (p < 0.05). Four genes discriminated cervical lesions from H with a sensitivity of 82.61%, a specificity of 87.83% and an area under the curve of 0.8981. Three genes discriminated CC from CIN with a sensitivity of 53.13%, a specificity of 96.39% and an area under the curve of 0.7786".

Note that the notebook with the code used to produce the above plots can be found in draft PR #176.

Feel free to leave any thoughts here or on that draft PR @cansavvy and @jaclyn-taroni! Note that I can surely continue the search for a suitable new dataset (or one that we previously used) if need be but wanted to post these datasets for time efficiency purposes in case one is good enough for our purpose.

cansavvy commented 4 years ago

As we discussed over video chat. I think the second data doesn't have enough metadata labels for our users to do much with.

The first one looks good to me. We can have added metadata labels one at a time. e.g. 0 labels, 1, label, 2 labels.

cbethell commented 4 years ago

This ticket can be closed now that it has been addressed with merged PR #176.

AlexsLemonade / refinebio-examples

Create PCA dimension reduction example with RNA-seq data #129