To identify new and/or improved methods for imputed missing values in our species compendia, we need to put together data that allows us to evaluate performance. This requires at least two matrices: 1) the complete matrix that contains the true values and 2) a matrix where some of the values have been replaced with NAs ("masked").
We want all missing values to have a true value associated with them -- i.e., no values in the complete matrix should be missing. This very likely means subsetting to genes that are only on the zebrafish Affymetrix microarray.
Here are the features of the masked matrix that we want:
Some values to be missing completely at random (MCAR)
Some values that are missing for some non-negligible number of samples ("missing rows at random"; see #5)
Some values missing for all microarray samples -- this represents the real challenge of measuring genes in RNA-seq that were not on the chips for the legacy data
Nice to have: masked values in RNA-seq data that reflect what we've observed -- shorter genes and genes with low expression values are more likely to be zero
To identify new and/or improved methods for imputed missing values in our species compendia, we need to put together data that allows us to evaluate performance. This requires at least two matrices: 1) the complete matrix that contains the true values and 2) a matrix where some of the values have been replaced with NAs ("masked").
We want all missing values to have a true value associated with them -- i.e., no values in the complete matrix should be missing. This very likely means subsetting to genes that are only on the
zebrafish
Affymetrix microarray.Here are the features of the masked matrix that we want: