JonathanShor / DoubletDetection

Doublet detection in single-cell RNA-seq data.
https://doubletdetection.readthedocs.io/en/stable/
MIT License
81 stars 23 forks source link

scanpy normalization #108

Closed adamgayoso closed 5 years ago

adamgayoso commented 6 years ago

Thanks to @yueqiw for trying this. We could add this as an alternative normalization procedure. What do you think @JonathanShor @ambrosejcarr?

import scanpy.api as sc
from scipy.sparse import issparse

def scanpy_normalizer(count_data):
    adata = sc.AnnData(X = count_data)
    if issparse(adata.X):
        adata.obs['n_counts'] = adata.X.sum(axis=1).A1
    else:
        adata.obs['n_counts'] = adata.X.sum(axis=1)
    sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
    filter_result = sc.pp.filter_genes_dispersion(
        adata.X, min_mean=0.02, max_mean=3, min_disp=0.8)
    adata = adata[:, filter_result.gene_subset]
    sc.pp.log1p(adata)
    sc.pp.regress_out(adata, ['n_counts'], n_jobs=8)
    sc.pp.scale(adata, max_value=10)
    return adata.X
adamgayoso commented 6 years ago

check if filtering genes could be done once on raw_counts and gene indices stored for later use

JonathanShor commented 6 years ago

I suppose this would go in plot.py with normalize_counts?

We should break both out into a utils.py, or even a normalizers.py.

@yueqiw would you be interested in creating a PR with your code?

ambrosejcarr commented 6 years ago

I can comment if you summarize at a high level what this buys us.

adamgayoso commented 6 years ago

@ambrosejcarr This provides users an easier way to use scanpy preprocessing. That said, since we already allow a custom normalizer function, there may be no need to explicitly add this to our code. Do you think it's worth adding given its popularity?

ambrosejcarr commented 6 years ago

Do you think it's worth adding given its popularity?

If you already have the code, I don't think it would hurt.

yueqiw commented 6 years ago

I'd be interested but probably won't have time in the next week or so... I can provide my thoughts on using this normalization method.

Parameters and robustness. Since there are quite a few parameters involved (The filter_genes_dispersion function and the regress_out function), we need to decide what keyword arguments to include, and users would need to choose parameters on their dataset. I chose the parameters based on my actual analysis in Seurat, so I think users should be aware that they need to go through the Seurat or Scanpy pipeline to see what parameters make sense. I believe the methods is quite robust and should easily work in most cases, but I've only worked with a few datasets.

Flexibility. I really like the way custom normalizer functions can be easily plugged in. So alternatively, the scanpy normalization can be described in the tutorial as a example of the custom normalizer function.

adamgayoso commented 6 years ago

I agree with @yueqiw showing how it can be used in a tutorial. I would alter the code slightly so that filter_result.gene_subset is calculated only on raw, non-augmented counts for additional speed up.

adamgayoso commented 5 years ago

Closing this due to inactivity.