broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
285 stars 52 forks source link

The raw counts calculated by cellbender is different from the raw counts of input file #286

Closed YiweiNiu closed 1 year ago

YiweiNiu commented 1 year ago

Hi,

I am testing cellbender v0.3.0 following the tutorial, but I find the n_raw reported by cellbender is different from the raw counts of the input. Here is the test

>>> import numpy as np
>>> import scanpy as sc
>>> from cellbender.remove_background.downstream import anndata_from_h5, load_anndata_from_input_and_output
>>> 
>>> # load the data
>>> adata = load_anndata_from_input_and_output(
...     input_file = 'tiny_raw_feature_bc_matrix.h5ad',
...     output_file = 'tiny_output.h5',
...     input_layer_key = 'raw',  # this will be the raw data layer
... )
Only 37760 barcodes were found.
This suggests the matrix was prefiltered.
CellBender requires a raw, unfiltered [Barcodes, Genes] matrix.
WARNING: Only 37760 barcodes in the input file. Ensure this is a raw (unfiltered) file with all barcodes, including the empty droplets.
>>> adata.var.loc['Ptma']
ambient_expression                  0.005005
features_analyzed_inds                     9
feature_type                 Gene Expression
genome                                  mm10
gene_id                   ENSMUSG00000026238
cellbender_analyzed                     True
n_raw                                 162136
n_cellbender                          155759
Name: Ptma, dtype: object
>>>
>>>
>>> adata_10x = sc.read("tiny_raw_feature_bc_matrix.h5ad")
>>> adata_10x.var["total_counts"] = np.ravel(adata_10x.X.sum(axis=0))
>>> adata_10x.var.loc['Ptma']
gene_id         ENSMUSG00000026238
genome                        mm10
feature_type       Gene Expression
total_counts                239060
Name: Ptma, dtype: object

The n_raw of gene Ptma reported by cellbender is 162136, while the one counted from the input is 239060. The np.ravel(X.sum(axis=0)) is from scanpy code.

I also checked this with R, and found the one calculated by scanpy is correct. Maybe this is a tiny thing. Or this metric is also used in other analysis besides the html report.

Best regards, Yiwei

sjfleming commented 1 year ago

Hi @YiweiNiu , this is a good observation! :)

But there is a reason for what you're seeing (I hope!).

The function load_anndata_from_input_and_output() by default loads only the droplets that were analyzed by cellbender. So the droplets past --total-droplets-included (or its default) are not included when the data is loaded.

I think that is the difference!

If you want load_anndata_from_input_and_output() to load all the droplets, you can use load_anndata_from_input_and_output(analyzed_barcodes_only=False) (https://cellbender.readthedocs.io/en/v0.3.0/reference/index.html#cellbender.remove_background.downstream.load_anndata_from_input_and_output). Then you should see that the total counts per gene agree with the raw data loaded by scanpy.

Let me know if this is not the case!

YiweiNiu commented 1 year ago

Thank you for your quick reply. Yes, it matches when using analyzed_barcodes_only=False parameter.