KrishnaswamyLab / MAGIC

MAGIC (Markov Affinity-based Graph Imputation of Cells), is a method for imputing missing values restoring structure of large biological datasets.
GNU General Public License v2.0
341 stars 97 forks source link

Recommended normalization methods #31

Closed b-dawes closed 6 years ago

b-dawes commented 7 years ago

Hello,

I've been trying out MAGIC on my lab's drop-seq data. I've found that the results vary between different normalization methods. I just wanted to check and see if you've tested different methods and have any recommendations.

The two methods I've tried are just simple library size normalization and then library size normalization + log transformation. Looking at the results, using log transformed values seems to give nicer results (although I'm not sure how biological valid it is). With the log transformed results, gene expression of marker genes is boosted but only localized to the cell clusters we had already identified. Without log transforming the data, it looks like the gene expression is smeared across the entire dataset with most cells expressing a little bit of most genes.

I just wanted to check and see if this matches your experiences and if you have any recommendations? I also haven't tried playing around with any parameters, so maybe changing k or the number of diffusion steps could help?

Thanks, Brian

dvdijk commented 7 years ago

Hi Brian,

log transformation is fairly standard practice. I have found that some datasets benefit more form it than others. For example in my EMT data I found it wasn't necessary so I just did lib. size normalization. Log transform (e.g. data = log(data + 0.1)) will boost the expression lowly expressed genes compared to the high ones, and since often the most interesting genes (like TFs) are lowly expressed this makes sense. I have found though that log transform can increase the lib size bias, even after lib. size normalization (preceding the log transform as you describe). This makes sense bc lowly sampled cells will have many zero counts but log(x + 0.1) will give them a value. As a result lowly sampled cells will get a bias from the log transform. However, have found that some datasets (I guess they have more exponentially distributed values for whatever reason) really benefit from the log transform. So it really depends on the dataset whether to do log transform or not.

ka should always be just small (e.g. 4 or 10) and k 3 times ka.

t you could play with. I'd start at around t=6 and slowly increase to see if you get better results.

Also make sure you do proper cell filtering. I have gotten much better results after removing cells that have a small library size. E.g. remove any cell with fewer than 1000 molecules (though again this threshold depends on the dataset).

David

b-dawes commented 7 years ago

Thanks for the quick response! I mostly just wanted to make sure there was nothing explicitly wrong with using log transformed data in magic. It sounds like we'll be sticking with logged data and I'll follow your advice about changing the parameters.

Brian

dvdijk commented 7 years ago

Just make sure that if you use logged data you're not doing the rescaling step