My name is Matthew Marino. I am a CFDE GlyGEN summer intern working under Jeet Vora and Rene Ranzinger. I am currently trying to create multiomics workflows starting with transcriptomics and proteomics data.
I am hoping to be able to quantify differentially expressed (DE) genes from a counts/metadata matrix. Get a list of DE genes with their corresponding log2FC and q values in ranked order by significance. From here, I am then trying to do a similar method for LC-MS/MS proteomics and glycoproteomics datasets to assess the overlap/correspondence as well as draw meaningful biological conclusions from the datasets.
In order to analyze differentially expressed genes from untreated and treated samples within a RNA Seq counts matrix, DESeq2 does the following, Briefly: DOI: 10.1186/s13059-014-0550-8
1.) Data input: with a counts matrix and metadata describing the sample identifiers in each column (this appears to be similar to the add AnnData option)
2.) Combining both into a single data object (again similar to the AnnData option)
3.) Filtering low reads based on rowsums.
4.) Size factor calculation: median ratio of each gene's count to the to the geometric mean of that gene's count across all samples.
5.) Dispersion and Shrinkage Estimation: Models variance by; calculating gene wise dispersion estimates and fitting them to a trend line to provide a mean-dispersion relationship. Shrinkage of dispersion estimates is done by fitting to the trend line to improve accuracy of the dispersion estimates.
6.) Fitting of a generalized linear model (GLM) with a negative binomial distribution specific to the metadata experimental conditions.
7.) Hypothesis testing: Wald test: are log2FC different from zero? Provides P values. Multiple testing correction: Benjamin-Hochberg method to provide q (p adj.) values. (controls the false discovery rate, statistical significance of differentially expressed genes).
8.) Results table
A quick note regarding the output: I am hoping that it can return a table which ranks the differentially expressed genes by their q value and still contains the gene name and log2FC of each. This will allow the user to see whether they are differentially expressed in what direction and to what statistical significance.
This package is widely used by the community and is considered one of the most accurate ways to depict variance among treated vs untreated samples in RNA Seq data.
Citations:
DOI: 10.1093/bib/bbt086
DOI: 10.1186/1471-2105-14-91
Addition of Algorithm:
My name is Matthew Marino. I am a CFDE GlyGEN summer intern working under Jeet Vora and Rene Ranzinger. I am currently trying to create multiomics workflows starting with transcriptomics and proteomics data. I am hoping to be able to quantify differentially expressed (DE) genes from a counts/metadata matrix. Get a list of DE genes with their corresponding log2FC and q values in ranked order by significance. From here, I am then trying to do a similar method for LC-MS/MS proteomics and glycoproteomics datasets to assess the overlap/correspondence as well as draw meaningful biological conclusions from the datasets.
In order to analyze differentially expressed genes from untreated and treated samples within a RNA Seq counts matrix, DESeq2 does the following, Briefly: DOI: 10.1186/s13059-014-0550-8
1.) Data input: with a counts matrix and metadata describing the sample identifiers in each column (this appears to be similar to the add AnnData option) 2.) Combining both into a single data object (again similar to the AnnData option) 3.) Filtering low reads based on rowsums. 4.) Size factor calculation: median ratio of each gene's count to the to the geometric mean of that gene's count across all samples. 5.) Dispersion and Shrinkage Estimation: Models variance by; calculating gene wise dispersion estimates and fitting them to a trend line to provide a mean-dispersion relationship. Shrinkage of dispersion estimates is done by fitting to the trend line to improve accuracy of the dispersion estimates. 6.) Fitting of a generalized linear model (GLM) with a negative binomial distribution specific to the metadata experimental conditions. 7.) Hypothesis testing: Wald test: are log2FC different from zero? Provides P values. Multiple testing correction: Benjamin-Hochberg method to provide q (p adj.) values. (controls the false discovery rate, statistical significance of differentially expressed genes). 8.) Results table
A quick note regarding the output: I am hoping that it can return a table which ranks the differentially expressed genes by their q value and still contains the gene name and log2FC of each. This will allow the user to see whether they are differentially expressed in what direction and to what statistical significance.
This package is widely used by the community and is considered one of the most accurate ways to depict variance among treated vs untreated samples in RNA Seq data. Citations: DOI: 10.1093/bib/bbt086 DOI: 10.1186/1471-2105-14-91