Closed mbabadi closed 5 years ago
These are parts of the remove-background
documentation that I stripped out form the main readme:
Modules:
========
``remove-background``
---------------------
The command line tool called ``remove-background`` removes ambient RNA from a
count matrix generated by CellRanger ``count``. Entries in the count matrix
are the sum of real counts and some background counts from ambient RNA. The
purpose of the ``remove-background`` tool is two-fold: (1) to remove the
background RNA counts from real cells, and (2) to determine which barcodes
contain cells and which correspond to "empty" droplets with only ambient RNA.
The inference procedure also embeds gene expression in a lower-dimensional
latent space, which can be used for clustering and visualization.
Input
A raw count matrix from single cell RNA sequencing,
including cell barcodes and empty droplet barcodes alike. This should be
the output of `CellRanger <https://support.10xgenomics.com/
single-cell-gene-expression/software/pipelines/
latest/what-is-cell-ranger>`_ ``count`` in the form of an `HDF5
file <https://support.10xgenomics.com/single-cell-gene-expression/
software/pipelines/latest/advanced/h5_matrices>`_.
Outputs
1. The output of the analysis is a new count matrix,
which is the counts after background subtraction. The new count matrix
is also in HDF5 format, and can be used in place of the raw count
matrix in downstream analyses.
2. A second output is the probability that each barcode contains a real
cell. This information is contained in the HDF5 file. For users
who would rather work with CSV files, there is also an output called
*cell_barcodes.csv* that lists each barcode which has been determined
to contain a cell.
3. A third output is a low-dimensional latent representation of the gene
expression of each cell. This information is contained in the
HDF5 file.
Cell probabilities can be used to filter out empty droplets
for downstream analyses. The low-dimensional latent
representation can be used for visualization and clustering.
Methodology
Inferences made by ``remove-background`` are based on a Bayesian model and
amortized stochastic variational inference (SVI). During the training
phase, the model automatically learns the ambient RNA profile from the
data. It also uses an auto-encoder-like neural module to learn the
manifold of true (i.e. background-corrected) gene expression. The latter
helps us distinguish empty from cell-containing droplets. The model and
the inference algorithm are implemented using the probabilistic programming
language `Pyro <https://github.com/pyro-ppl/pyro/>`_.
Additional details about the method, including extensive envaluations,
can be found in the following pre-print:
S.J. Fleming, M. Babadi, et al. Unsupervised removal of background RNA
counts from scRNA-seq datasets. (2019) bioRxiv.
PR #20 introduces a basic structure of CellBender documentation. I will keep this issue open to remind us to review and improve the documentation before v0.1 release.
Adding in basic Sphinx to the codebase.