Closed igordot closed 6 years ago
Hi @igordot ,
These aren't particularly problematic, but probably worth keeping in mind when you analyze your data.
In gene expression data, if we have a column containing only zeroes, then we definitely can't gain anything by analyzing it - best thing to do is simply to remove it (with, for example, scprep.filter.remove_empty_genes(data)
)
If you're dealing with single-cell RNA sequencing, it's unlikely (but possible) that you have cells that are exact duplicates. MAGIC can handle duplicates, but their presence is sometimes indicative of poor quality data. If you have duplicates, it's likely because you have cells with extremely low library sizes (as combinatorially it's extremely unlikely with larger cells.) You should remove these, as they will introduce artifacts into your dataset. You can do this with scprep.filter.filter_library_size
.
Thanks for clarifying! That makes sense. I would not have necessarily expected that from a fairly small example dataset, but it's certainly possible.
Will the filtering steps eventually get R wrappers?
Ah, of course, you did mention you're using the R wrapper. I'll admit it's not something I've really given thought to. I built scprep
because I found the Python infrastructure for loading and preprocessing scRNAseq data lacking; I assumed (perhaps wrongly) that this space was more or less already filled in R. If enough people are interested, I would consider porting scprep
to R as well.
Sure, there are other ways to do it. Personally, I was planning on running MAGIC on pre-processed matrices, so those particular issues should theoretically not come up. I was just concerned that some parts are python-only in case there are other errors I need to troubleshoot later.
If the matrices are preprocessed, it should probably not come up. Let me know if you run into any hurdles, but the MAGIC algorithm itself is fully ported to R - scprep
is really just a bunch of matrix helper functions.
I'm going to go ahead and close this issue, but for anyone stumbling across this issue and wondering how to solve these in R:
data <- data[, colSums(data) > 0]
data <- data[rowSums(data) > 2000, ]
# 2000 is a good guess but you should look at the histogram and cut below the mode
I just installed Rmagic (via CRAN) and magic-impute (via pip). The provided example generates a few warnings:
Are those expected? If these can be ignored for test data, how do I treat them with real data?