eleozzr / desc

Deep Embedding for Single-cell Clustering
https://eleozzr.github.io/desc/
87 stars 23 forks source link

How to remove batch effect? #9

Open zhumengyan opened 5 years ago

zhumengyan commented 5 years ago

Hi, eleozzr! Thanks for your great tool. Advantage of your tool that attracts me is batch effect removal. I have two 10X scRNA-Seq dataset. And I want to combine them and then cluster. So I have two questions:

  1. Can I use DESC to remove batch effect and then cluster?
  2. If DESC can deal with this situation, could you provide me some simple code, as I can't find related information in your tutorial?

Thanks Mengyan Zhu

eleozzr commented 5 years ago

Hi, eleozzr! Thanks for your great tool. Advantage of your tool that attracts me is batch effect removal. I have two 10X scRNA-Seq dataset. And I want to combine them and then cluster. So I have two questions:

  1. Can I use DESC to remove batch effect and then cluster?
  2. If DESC can deal with this situation, could you provide me some simple code, as I can't find related information in your tutorial?

Thanks Mengyan Zhu

Hi Mengyan, Thanks for your question. DESC can remove batch effect iteratively. The first step you need to do is combined these two 10X scRNA-Seq dataset.

#support you have already got a preprocessed `AnnData` object(normalization, select highly variable genes, et al.)
import scanpy.api as sc
import desc
desc=desc.scale_bygroup(adata, groupby="Group") #Group is the Group name
save_dir="tmp_results"
adata=DESC.train(adata,
        dims=[adata.shape[1],128,32],
        n_neighbors=10,
        tol=0.001,
        batch_size=256,
        louvain_resolution=[0.5,0.6],
        save_dir=save_dir,
        do_tsne=True,
        use_GPU=False,
        num_Cores=1,
        num_Cores_tsne=5,
        save_encoder_weights=False,
        use_ae_weights=False,
        do_umap=False, # you can set true to compute umap
        learning_rate=400)# you can change the values of other parameters.

After training, you can get a new AnnData object, you can check the tsne or umap plots.

sc.tl.scatter(adata,basis="tsne0.5",color=["desc_0.5","Group"]) #  make sure "Group" in your `adata.obs.columns`.

Hope this helps you. Thanks.

tianyi21 commented 4 years ago

A followup question. If I have two datasets, the selected highly variable genes may not agree, i.e., the genes can be different and the number of HV genes can be different as well. From the computational aspect, I think this will lead to a problem. Shall I make an assumption that different batches should have the same genes after pre-proc? Thanks!

eleozzr commented 4 years ago

There are two options, the first one is you can take the union HVGs of two datasets and the second one is that you used the intersection of two datasets' HVGs. But for the first one, you should use the 'outer' join, which means take 0 for those genes that only exist in one dataset. you can try

import anndata
adata=anndata.AnnData.concatenate(*adata,join="outer") #join="outer"

or in Seurat by using

obj=Seurat::merge(obj1,obj2)
tianyi21 commented 4 years ago

There are two options, the first one is you can take the union HVGs of two datasets and the second one is that you used the intersection of two datasets' HVGs. But for the first one, you should use the 'outer' join, which means take 0 for those genes that only exist in one dataset. you can try


import anndata

adata=anndata.AnnData.concatenate(*adata,join="outer") #join="outer"

or in Seurat by using


obj=Seurat::merge(obj1,obj2)

Thanks Xiangjie!