AlexsLemonade / compendium-processing

A series of analyses related to refine.bio species compendia
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Construct "masked" zebrafish matrix for use as gold standard for imputation challenge #10

Open jaclyn-taroni opened 5 years ago

jaclyn-taroni commented 5 years ago

To identify new and/or improved methods for imputed missing values in our species compendia, we need to put together data that allows us to evaluate performance. This requires at least two matrices: 1) the complete matrix that contains the true values and 2) a matrix where some of the values have been replaced with NAs ("masked").

We want all missing values to have a true value associated with them -- i.e., no values in the complete matrix should be missing. This very likely means subsetting to genes that are only on the zebrafish Affymetrix microarray.

Here are the features of the masked matrix that we want:

jaclyn-taroni commented 5 years ago

We'll want to do everything but the QN step (see: https://github.com/AlexsLemonade/refinebio/issues/508#issuecomment-435879283), so that includes the log2(x+1) transformation of the RNA-seq data.

jaclyn-taroni commented 5 years ago

WIP here: https://github.com/AlexsLemonade/compendium-processing/blob/jaclyn-taroni/masked/masked_gold_standard