Closed jaclyn-taroni closed 5 years ago
For the compendium of all the data for an organism, we'll need to give it some TLC. This likely means that we'll need to include the NAs (so we should have this be configurable for the smasher at least via the API, even if we use the inner join as the default for the web UI). Then we will likely want to prune things that are too missing to impute (>30% missing is the usual threshold). After pruning, we'll likely want to use KNN to impute the rest.
A quick google of "knn impute python" gets me: https://github.com/iskandr/knnimpute
There are a lot of ways to do knn impute poorly, but that one looks good at a first glance.
To summarize, for species-level compendia:
I'm going to change the title to better reflect this.
From Troyanskaya, et al. 2001:
KNNimpute algorithm
The KNN-based method selects genes with expression profiles similar to the gene of interest to impute missing values. If we consider gene A that has one missing value in experiment 1, this method would find K other genes, which have a value present in experiment 1, with expression most similar to A in experiments 2–N (where N is the total number of experiments). A weighted average of values in experiment 1 from the K closest genes is then used as an estimate for the missing value in gene A. In the weighted average, the contribution of each gene is weighted by similarity of its expression to that of gene A.
Writing some more detailed notes about how this might look:
tximport
, lengthScaledTPM
). On the other hand, if we use Euclidean distance (more sensitive to outliers) as specified above, the log2(x + 1) transformation could be helpful. My googling of RNA-seq data imputation yielded top results related to scRNA-seq and not bulk as far as I can tell, unfortunately.I think some small scale experiments are in order as some of this is based on intuition and/or experience with a single platform.
What I'll need to do small scale experiments:
lengthScaledTPM.tsv
for all available zebrafish samples (there's 400ish according to the frontend and that's probably around where we want to be)I can come up with a list of sample accession codes for the last point
GEO microarray samples from zebrafish
, zebgene10st
, and zebgene11st
: zf_samples_three_affy_platforms.txt
Notes: 1) For full outer joins (union of all genes), missing values should be filled with NA
. 2) We should keep track of where the imputed values are in the final matrix, as this might be something users want to know.
Within a species:
microarray_expression_matrix
(this may end up being a DataFrame
)lengthScaledTPM
) with a full outer join to form a rnaseq_expression_matrix
lengthScaledTPM
values for each row (gene) of the rnaseq_expression_matrix
(rnaseq_row_sums
)rnaseq_row_sums
rnaseq_expression_matrix
with a row sum < 10th percentile of rnaseq_row_sums
; this is now filtered_rnaseq_matrix
log2(x + 1)
transform filtered_rnaseq_matrix
; this is now log2_rnaseq_matrix
log2_rnaseq_matrix
to NA
, but make sure to keep track of where these zeroes aremicroarray_expression_matrix
and log2_rnaseq_matrix
; combined_matrix
combined_matrix
combined_matrix
NA
in RNA-seq samples (i.e., make these zero again) in combined_matrix
combined_matrix
; transposed_matrix
IterativeSVD
(rank=10
) on the transposed_matrix
; imputed_matrix
imputed_matrix
(genes are now rows, samples are now columns)imputed_matrix
where genes are rows and samples are columnsI would like this checklist to be included in the methods section of the PR where it is implemented.
Context
I have been thinking quite a bit about integrating platforms recently as a result of the QN discussion (#488).
Problem or idea
Because we're using an inner join in the smasher, we'll drop genes that are not included in every platform. If we were to include
HGU133B
andHGU95E
in the human compendium in this manner, we would presumably be missing many highly expressed and important genes!Solution or next step
Tagging @cgreene for discussion and @Miserlou to comment on implementation. There are a few routes I can think of at this time, though I can not comment on feasibility:
pandas.merge
. This would also probably change how we do QN. (I believepreprocessCore::normalize.quantiles.use.target
, if we go that route, assumes genes are missing at random. Not a good assumption here.)New Issue Checklist