broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
299 stars 54 forks source link

Basic readthedocs #9

Closed mbabadi closed 5 years ago

mbabadi commented 5 years ago

Adding in basic Sphinx to the codebase.

mbabadi commented 5 years ago

These are parts of the remove-background documentation that I stripped out form the main readme:

Modules:
========

``remove-background``
---------------------

The command line tool called ``remove-background`` removes ambient RNA from a
count matrix generated by CellRanger ``count``.  Entries in the count matrix
are the sum of real counts and some background counts from ambient RNA.  The
purpose of the ``remove-background`` tool is two-fold: (1) to remove the
background RNA counts from real cells, and (2) to determine which barcodes
contain cells and  which correspond to "empty" droplets with only ambient RNA.
The inference procedure also embeds gene expression in a lower-dimensional
latent space, which can be used for clustering and visualization.

Input
    A raw count matrix from single cell RNA sequencing,
    including cell barcodes and empty droplet barcodes alike.  This should be
    the output of `CellRanger <https://support.10xgenomics.com/
    single-cell-gene-expression/software/pipelines/
    latest/what-is-cell-ranger>`_ ``count`` in the form of an `HDF5
    file <https://support.10xgenomics.com/single-cell-gene-expression/
    software/pipelines/latest/advanced/h5_matrices>`_.

Outputs
    1. The output of the analysis is a new count matrix,
       which is the counts after background subtraction.  The new count matrix
       is also in HDF5 format, and can be used in place of the raw count
       matrix in downstream analyses.
    2. A second output is the probability that each barcode contains a real
       cell.  This information is contained in the HDF5 file.  For users
       who would rather work with CSV files, there is also an output called
       *cell_barcodes.csv* that lists each barcode which has been determined
       to contain a cell.
    3. A third output is a low-dimensional latent representation of the gene
       expression of each cell.  This information is contained in the
       HDF5 file.

    Cell probabilities can be used to filter out empty droplets
    for downstream analyses.  The low-dimensional latent
    representation can be used for visualization and clustering.

Methodology
    Inferences made by ``remove-background`` are based on a Bayesian model and
    amortized stochastic variational inference (SVI).  During the training
    phase, the model automatically learns the ambient RNA profile from the
    data.  It also uses an auto-encoder-like neural module to learn the
    manifold of true (i.e. background-corrected) gene expression. The latter
    helps us distinguish empty from cell-containing droplets. The model and
    the inference algorithm are implemented using the probabilistic programming
    language `Pyro <https://github.com/pyro-ppl/pyro/>`_.
    Additional details about the method, including extensive envaluations,
    can be found in the following pre-print:

    S.J. Fleming, M. Babadi, et al. Unsupervised removal of background RNA
    counts from scRNA-seq datasets. (2019) bioRxiv.
mbabadi commented 5 years ago

PR #20 introduces a basic structure of CellBender documentation. I will keep this issue open to remind us to review and improve the documentation before v0.1 release.