KrishnaswamyLab / MAGIC

MAGIC (Markov Affinity-based Graph Imputation of Cells), is a method for imputing missing values restoring structure of large biological datasets.
GNU General Public License v2.0
341 stars 97 forks source link

Getting bad results using different normalization #32

Closed shmohammadi86 closed 5 years ago

shmohammadi86 commented 7 years ago

Hi,

I am using MAGIC in comparison with a few different methods and I get unsatisfactory results in all benchmarks, which made me wonder if I am doing something wrong. I am masking parts of the expression matrix and predict it using MAGIC, then compare predictions with known values that were masked. I log transformed the expression data (log2(1+x), so all are positive), then norm-infinity normalized the results (so max value in each column is one). I used default parameters ( k = 30; ka = 10; npca = 20), and I increased t from 6 up to 100 (6 is absolutely terrible, by 100 it gets more reasonable). Still, I get bad correlation/relative error for predictions of MAGIC compared to the true values that were masked out in cross-validation. One of the datasets I tried this on is https://support.10xgenomics.com/single-cell-gene-expression/datasets/pbmc4k. Oh, and since my normalization preserves positivity, I rescaled to 99 % (which improved the results partially, but not enough).

xyl012 commented 7 years ago

Hi shmohammadi86, I am curious about what your research so far shows, as I am just starting to use MAGIC, but have not seen any papers comparing methods, and am also using 10x data. If you could recommend a method for imputation, I would be extremely grateful. Cheers

xyl012 commented 7 years ago

btw, it says they didn't transform the data so maybe that's part of the cause. If I were to recommend a sanity check, I would say to try before and after.

dvdijk commented 6 years ago

we L1 norm the data (library size norm) and then usually we do log (or sqrt) transform

mfilip8 commented 5 years ago

we L1 norm the data (library size norm) and then usually we do log (or sqrt) transform

Hi, great work! Nice, precise, detailed and clean paper.

Can I ask you about what criteria do you use for log or sqrt (or none) transform data? When would you consider extreme the distribution of gene expression?

Thanks a lot in advance

dvdijk commented 5 years ago

@shmohammadi86 I've never used norm-infinity normalization. We (and the field) generally normalize by the sum and not max.

Sqrt is convenient because it doesn't require a pseudocount, e.g. log(x + 0.1) as sqrt(0) = 0

I would always use a log type (including sqrt) transform