Processing species-level compendia

jaclyn-taroni commented 6 years ago

Context

I have been thinking quite a bit about integrating platforms recently as a result of the QN discussion (#488).

Problem or idea

Because we're using an inner join in the smasher, we'll drop genes that are not included in every platform. If we were to include HGU133B and HGU95E in the human compendium in this manner, we would presumably be missing many highly expressed and important genes!

Solution or next step

Tagging @cgreene for discussion and @Miserlou to comment on implementation. There are a few routes I can think of at this time, though I can not comment on feasibility:

Change smasher behavior in the compendium case such that we fill in missing genes with NAs. This would mean changing the arguments to pandas.merge. This would also probably change how we do QN. (I believe preprocessCore::normalize.quantiles.use.target, if we go that route, assumes genes are missing at random. Not a good assumption here.)
We curate a list of platforms that we don't include in the compendium. This may be trickier for model organisms where I have less foundational knowledge, but I am sure I could figure it out.

New Issue Checklist

[x] The title is short and descriptive.
[x] You have explained the context that led you to write this issue.
[x] You have reported a problem or idea.
[x] You have proposed a solution or next step.

cgreene commented 6 years ago

For the compendium of all the data for an organism, we'll need to give it some TLC. This likely means that we'll need to include the NAs (so we should have this be configurable for the smasher at least via the API, even if we use the inner join as the default for the web UI). Then we will likely want to prune things that are too missing to impute (>30% missing is the usual threshold). After pruning, we'll likely want to use KNN to impute the rest.

jaclyn-taroni commented 6 years ago

A quick google of "knn impute python" gets me: https://github.com/iskandr/knnimpute

cgreene commented 6 years ago

There are a lot of ways to do knn impute poorly, but that one looks good at a first glance.

jaclyn-taroni commented 6 years ago

To summarize, for species-level compendia:

We'd want to do a full outer join (union of all genes). Missing values should be NAs (I'm assuming that this is the default behavior, but we don't want it to be zero or some other value).
Drop genes with >30% missing values
Impute values (KNN impute)
Quantile normalize

jaclyn-taroni commented 6 years ago

I'm going to change the title to better reflect this.

jaclyn-taroni commented 6 years ago

From Troyanskaya, et al. 2001:

KNNimpute algorithm

The KNN-based method selects genes with expression profiles similar to the gene of interest to impute missing values. If we consider gene A that has one missing value in experiment 1, this method would find K other genes, which have a value present in experiment 1, with expression most similar to A in experiments 2–N (where N is the total number of experiments). A weighted average of values in experiment 1 from the K closest genes is then used as an estimate for the missing value in gene A. In the weighted average, the contribution of each gene is weighted by similarity of its expression to that of gene A.

jaclyn-taroni commented 6 years ago

Writing some more detailed notes about how this might look:

10_7_18 12_54 pm office lens

Some more notes / details

RNA-seq zeroes could be biologically meaningful (e.g., transcript is not expressed or is very lowly-expressed in the tissue being sampled), but for this purpose we'll likely set zeroes to NAs.
My current thinking around splitting by technology:
- I expect the distributions of RNA-seq data and microarray data to be pretty different, so imputing together strikes me as a bit strange.
- The rationale for quantile normalizing are similar to above -- that is, I expect that no-op'd Affymetrix data that's been processed with RMA to have different values from refine.bio processed Affymetrix data (SCAN). In practice, I'm not sure if the order of these two steps will make a big difference and if it does not, it's simpler to keep the order the same between technologies.
I believe we're still log2(x + 1) transforming RNA-seq data when we're aggregating by species. This shouldn't affect the quantile normalization (zeroes remain, rank is same), so we may want to take that out as specified on #488. I'm not sure if we want to do this transformation in this setting. I was imagining we'd perform imputation on the count scale data (output of tximport, lengthScaledTPM). On the other hand, if we use Euclidean distance (more sensitive to outliers) as specified above, the log2(x + 1) transformation could be helpful. My googling of RNA-seq data imputation yielded top results related to scRNA-seq and not bulk as far as I can tell, unfortunately.

I think some small scale experiments are in order as some of this is based on intuition and/or experience with a single platform.

jaclyn-taroni commented 6 years ago

What I'll need to do small scale experiments:

The most up-to-date zebrafish QN target
lengthScaledTPM.tsv for all available zebrafish samples (there's 400ish according to the frontend and that's probably around where we want to be)
Around 400 randomly microarray samples prior to smashing (e.g., individual "PCL" files)

I can come up with a list of sample accession codes for the last point

jaclyn-taroni commented 6 years ago

GEO microarray samples from zebrafish, zebgene10st, and zebgene11st: zf_samples_three_affy_platforms.txt

jaclyn-taroni commented 6 years ago

Notes: 1) For full outer joins (union of all genes), missing values should be filled with NA. 2) We should keep track of where the imputed values are in the final matrix, as this might be something users want to know.

Within a species:

[ ] Combine all microarray samples with a full join to form a microarray_expression_matrix (this may end up being a DataFrame)
[ ] Combine all RNA-seq samples (lengthScaledTPM) with a full outer join to form a rnaseq_expression_matrix
[ ] Calculate the sum of the lengthScaledTPM values for each row (gene) of the rnaseq_expression_matrix (rnaseq_row_sums)
[ ] Calculate the 10th percentile of rnaseq_row_sums
[ ] Drop all rows in rnaseq_expression_matrix with a row sum < 10th percentile of rnaseq_row_sums; this is now filtered_rnaseq_matrix
[ ] log2(x + 1) transform filtered_rnaseq_matrix; this is now log2_rnaseq_matrix
[ ] Set all zero values in log2_rnaseq_matrix to NA, but make sure to keep track of where these zeroes are
[ ] Perform a full outer join of microarray_expression_matrix and log2_rnaseq_matrix; combined_matrix
[ ] Remove genes (rows) with >30% missing values in combined_matrix
[ ] Remove samples (columns) with >50% missing values in combined_matrix
[ ] "Reset" zero values that were set to NA in RNA-seq samples (i.e., make these zero again) in combined_matrix
[ ] Transpose combined_matrix; transposed_matrix
[ ] Perform imputation of missing values with IterativeSVD (rank=10) on the transposed_matrix; imputed_matrix
[ ] Untranspose imputed_matrix (genes are now rows, samples are now columns)
[ ] Quantile normalize imputed_matrix where genes are rows and samples are columns

I would like this checklist to be included in the methods section of the PR where it is implemented.

AlexsLemonade / refinebio