AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Processing species-level compendia #508

Closed jaclyn-taroni closed 5 years ago

jaclyn-taroni commented 6 years ago

Context

I have been thinking quite a bit about integrating platforms recently as a result of the QN discussion (#488).

Problem or idea

Because we're using an inner join in the smasher, we'll drop genes that are not included in every platform. If we were to include HGU133B and HGU95E in the human compendium in this manner, we would presumably be missing many highly expressed and important genes!

Solution or next step

Tagging @cgreene for discussion and @Miserlou to comment on implementation. There are a few routes I can think of at this time, though I can not comment on feasibility:

New Issue Checklist

cgreene commented 6 years ago

For the compendium of all the data for an organism, we'll need to give it some TLC. This likely means that we'll need to include the NAs (so we should have this be configurable for the smasher at least via the API, even if we use the inner join as the default for the web UI). Then we will likely want to prune things that are too missing to impute (>30% missing is the usual threshold). After pruning, we'll likely want to use KNN to impute the rest.

jaclyn-taroni commented 6 years ago

A quick google of "knn impute python" gets me: https://github.com/iskandr/knnimpute

cgreene commented 6 years ago

There are a lot of ways to do knn impute poorly, but that one looks good at a first glance.

jaclyn-taroni commented 6 years ago

To summarize, for species-level compendia:

jaclyn-taroni commented 6 years ago

I'm going to change the title to better reflect this.

jaclyn-taroni commented 6 years ago

From Troyanskaya, et al. 2001:

KNNimpute algorithm

The KNN-based method selects genes with expression profiles similar to the gene of interest to impute missing values. If we consider gene A that has one missing value in experiment 1, this method would find K other genes, which have a value present in experiment 1, with expression most similar to A in experiments 2–N (where N is the total number of experiments). A weighted average of values in experiment 1 from the K closest genes is then used as an estimate for the missing value in gene A. In the weighted average, the contribution of each gene is weighted by similarity of its expression to that of gene A.

jaclyn-taroni commented 6 years ago

Writing some more detailed notes about how this might look:

10_7_18 12_54 pm office lens

Some more notes / details

I think some small scale experiments are in order as some of this is based on intuition and/or experience with a single platform.

jaclyn-taroni commented 6 years ago

What I'll need to do small scale experiments:

I can come up with a list of sample accession codes for the last point

jaclyn-taroni commented 6 years ago

GEO microarray samples from zebrafish, zebgene10st, and zebgene11st: zf_samples_three_affy_platforms.txt

jaclyn-taroni commented 6 years ago

Notes: 1) For full outer joins (union of all genes), missing values should be filled with NA. 2) We should keep track of where the imputed values are in the final matrix, as this might be something users want to know.

Within a species:

I would like this checklist to be included in the methods section of the PR where it is implemented.