broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
279 stars 50 forks source link

How can I make UMAPs with and without(on the same set of cell barcodes) #232

Open ruiqiizhou opened 1 year ago

ruiqiizhou commented 1 year ago

I was trying to follow the tutorial: ‘Create some validation plots of various analyses with and without cellbender remove-background. One convenient way to do this is in scanpy and storing the raw count matrix and the background-removed count matrix as separate “layers”. UMAPs with and without (on the same set of cell barcodes) Marker gene dotplots and violin plots (you should see less background)’'

I am using anndata, I have raw anndata and after cellbender anndata. But I do not know how to 'storing the raw count matrix and the background-removed count matrix as separate “layers”. and plot UMAPs with and without (on the same set of cell barcodes)'

after hellbender, anndata like this : AnnData object with n_obs × n_vars = 1483 × 32227 obs: 'latent_RT_efficiency', 'latent_cell_probability', 'latent_scale' var: 'ambient_expression', 'feature_type', 'gene_id' uns: 'contamination_fraction_params', 'fraction_data_used_for_testing', 'lambda_multiplier', 'overdispersion_mean_and_scale', 'target_false_positive_rate', 'test_elbo', 'test_epoch', 'training_elbo_per_epoch' obsm: 'latent_gene_encoding'

This is my raw dataset:

AnnData object with n_obs × n_vars = 404099 × 56984 var: 'gene_ids', 'feature_types' layers: 'ambiguous', 'spliced', 'unspliced'

sjfleming commented 1 year ago

Ah, so using layers in anndata is a great trick, and it's easy to do. One way would be to use this function

https://github.com/broadinstitute/CellBender/blob/a3420fd7b722d1ff14ffe7c588ae2225a37f2763/cellbender/remove_background/downstream.py#L193

You can use this code from a Jupyter notebook by doing

from cellbender.remove_background.downstream import load_anndata_from_input_and_output

adata = load_anndata_from_input_and_output(input_file='raw.h5', output_file='cellbender_output.h5')

or you could manually do the same thing yourself, kind of like this:

adata_raw  # your raw data
adata  # cellbender output

adata.layers['cellbender'] = adata.X.copy()
adata.layers['raw'] = adata_raw[adata.obs_names].layers['spliced'].copy()

Then you can use the different layers to make different UMAPs like this

# preprocess raw data
adata.X = adata.layers['raw'].copy()
sc.tl.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3')
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.tl.pca(adata, n_comps=20, use_highly_variable=True)
sc.pp.neighbors(adata, use_rep='X_pca')

# UMAP of raw data
sc.tl.umap(adata)
adata.obsm['X_umap_raw'] = adata.obsm['X_umap'].copy()

# preprocess cellbender data
adata.X = adata.layers['cellbender'].copy()
sc.tl.highly_variable_genes(adata, n_top_genes=2000, flavor='seurat_v3')
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.tl.pca(adata, n_comps=20, use_highly_variable=True)
sc.pp.neighbors(adata, use_rep='X_pca')

# UMAP of raw data
sc.tl.umap(adata)
adata.obsm['X_umap_cellbender'] = adata.obsm['X_umap'].copy()

del adata.obsm['X_umap']

Then you can do things like

sc.pl.embedding(adata, basis='X_umap_raw')
sc.pl.embedding(adata, basis='X_umap_cellbender')