Oshlack / splatter

Simple simulation of single-cell RNA sequencing data
http://oshlacklab.com/splatter/
GNU General Public License v3.0
217 stars 57 forks source link

No clear clusters in splat simulation #152

Closed XinyeZhao closed 2 years ago

XinyeZhao commented 2 years ago

Hi Splatter team,

When I use the PBMC3k data as the input for Splatter, I found that it seems the model considers all the genes to be highly variable gene. Because if I use scanpy to preprocess the simulated data and select top 2000 HVG to run leiden and visualize with umap, the clusters are not well separated even if I set de.prob to be 1. However, I noticed that the more genes I use for leiden, the better the clusters are separated. I just want to make sure if it's because I used the splatter in a wrong way. The first figure is using 7000 HVG and the second figure is using 2000 HVG. image

image Thanks!

lazappi commented 2 years ago

Hi @XinyeZhao

Thanks for giving {splatter} a go. I'm not quite clear what your question is. Can you please provide the code you are using? The splat model doesn't have an explicit idea of a "highly variable gene" so I'm not quite sure what you are looking at.

XinyeZhao commented 2 years ago

Sure, I used python with scanpy to process the data and here is my code. It' just the standard pipeline of data preprocessing. And if I use the same code for the real PBMC3k data, filtering the top 2000 HVG is enough to see clear cluster with leiden, but data generated from Splatter require about 10k HVG to show the clusters

adata = ad.AnnData(simul_data.values)  

sc.pp.filter_genes(adata, min_cells=3)  

sc.pp.normalize_total(adata)  
sc.pp.log1p(adata)  
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5, n_top_genes=2000)  
adata = adata[:, adata.var.highly_variable]    
sc.tl.pca(adata, svd_solver='arpack', n_comps=30)  
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=1)
print(adata.X.shape)
sc.pl.umap(adata, color=['leiden'])`
lazappi commented 2 years ago

How clearly separated clusters are depends on the parameters passed to splatSimulate(). The estimation process does not set the parameters to create clusters so they need to be supplied manually.

lazappi commented 2 years ago

@XinyeZhao Is this ok now or do you have any follow up questions?