Rmagic: Detected zero distance between samples

KrishnaswamyLab / MAGIC

MAGIC (Markov Affinity-based Graph Imputation of Cells), is a method for imputing missing values restoring structure of large biological datasets.

GNU General Public License v2.0

341 stars 97 forks source link

Rmagic: Detected zero distance between samples #133

Closed igordot closed 6 years ago

igordot commented 6 years ago

I just installed Rmagic (via CRAN) and magic-impute (via pip). The provided example generates a few warnings:

> MAGIC_data <- magic(magic_testdata, genes=c("VIM", "CDH1", "ZEB1"))
Calculating MAGIC...
/path/python3.6/site-packages/magic/magic.py:352: UserWarning: Input matrix contains unexpressed genes. Please remove them prior to running MAGIC.
  warnings.warn("Input matrix contains unexpressed genes. "
Calculating graph and diffusion operator...
Calculating PCA...
Calculated PCA in 3.96 seconds.
Calculating KNN search...
/path/python3.6/site-packages/graphtools/graphs.py:247: RuntimeWarning: Detected zero distance between samples 0 and 74, 26 and 109, 60 and 389, 68 and 167, 118 and 461, 124 and 434, 218 and 427, 296 and 345, 372 and 387, 374 and 417. Consider removing duplicates to avoid errors in downstream processing.
  RuntimeWarning)
Calculated KNN search in 0.11 seconds.

Are those expected? If these can be ignored for test data, how do I treat them with real data?

scottgigante commented 6 years ago

Hi @igordot ,

These aren't particularly problematic, but probably worth keeping in mind when you analyze your data.

In gene expression data, if we have a column containing only zeroes, then we definitely can't gain anything by analyzing it - best thing to do is simply to remove it (with, for example, scprep.filter.remove_empty_genes(data))

If you're dealing with single-cell RNA sequencing, it's unlikely (but possible) that you have cells that are exact duplicates. MAGIC can handle duplicates, but their presence is sometimes indicative of poor quality data. If you have duplicates, it's likely because you have cells with extremely low library sizes (as combinatorially it's extremely unlikely with larger cells.) You should remove these, as they will introduce artifacts into your dataset. You can do this with scprep.filter.filter_library_size.

igordot commented 6 years ago

Thanks for clarifying! That makes sense. I would not have necessarily expected that from a fairly small example dataset, but it's certainly possible.

Will the filtering steps eventually get R wrappers?

scottgigante commented 6 years ago

Ah, of course, you did mention you're using the R wrapper. I'll admit it's not something I've really given thought to. I built scprep because I found the Python infrastructure for loading and preprocessing scRNAseq data lacking; I assumed (perhaps wrongly) that this space was more or less already filled in R. If enough people are interested, I would consider porting scprep to R as well.

igordot commented 6 years ago

Sure, there are other ways to do it. Personally, I was planning on running MAGIC on pre-processed matrices, so those particular issues should theoretically not come up. I was just concerned that some parts are python-only in case there are other errors I need to troubleshoot later.

scottgigante commented 6 years ago

If the matrices are preprocessed, it should probably not come up. Let me know if you run into any hurdles, but the MAGIC algorithm itself is fully ported to R - scprep is really just a bunch of matrix helper functions.

scottgigante commented 6 years ago

I'm going to go ahead and close this issue, but for anyone stumbling across this issue and wondering how to solve these in R:

Removing empty genes

data <- data[, colSums(data) > 0]

Library size filtering

data <- data[rowSums(data) > 2000, ] 
# 2000 is a good guess but you should look at the histogram and cut below the mode