cNMF is a pipeline for inferring gene expression programs from scRNA-Seq. It takes a count matrix (N cells X G genes) as input and produces a (K x G) matrix of gene expression programs (GEPs) and a (N x K) matrix specifying the usage of each program for each cell in the data. Read more about the method in the publication and check out examples on simulated data and PBMCs.
We have also created a tutorial for running cNMF from R. See the Rmd notebook or the compiled html for this.
cNMF has been tested with Python 3.7 and 3.10 and requires scikit-learn>=1.0, scanpy>=1.8, and AnnData>=0.9
You can install with pip:
pip install cnmf
If you want to use the batch correction preprocessing, you also need to install the Python implementation of Harmony and scikit-misc
pip install harmonypy
pip install scikit-misc
cNMF can be run from the command line without any parallelization using the example commands below:
cnmf prepare --output-dir ./example_data --name example_cNMF -c ./example_data/counts_prefiltered.txt -k 5 6 7 8 9 10 11 12 13 --n-iter 100 --seed 14
cnmf factorize --output-dir ./example_data --name example_cNMF --worker-index 0 --total-workers 1
cnmf combine --output-dir ./example_data --name example_cNMF
cnmf k_selection_plot --output-dir ./example_data --name example_cNMF
cnmf consensus --output-dir ./example_data --name example_cNMF --components 10 --local-density-threshold 0.01 --show-clustering
Or alternatively, the same steps can be run from within a Python environment using the commands below:
from cnmf import cNMF
cnmf_obj = cNMF(output_dir="./example_data", name="example_cNMF")
cnmf_obj.prepare(counts_fn="./example_data/counts_prefiltered.txt", components=np.arange(5,14), n_iter=100, seed=14)
cnmf_obj.factorize(worker_i=0, total_workers=1)
cnmf_obj.combine()
cnmf_obj.k_selection_plot()
cnmf_obj.consensus(k=10, density_threshold=0.01)
usage, spectra_scores, spectra_tpm, top_genes = cnmf_obj.load_results(K=10, density_threshold=0.01)
For the Python environment approach, usage
will contain the usage matrix with each cell normalized to sum to 1. spectra_scores
contains the gene_spectra_scores output (aka Z-score unit gene expression matrix), spectra_tpm
contains the GEP spectra in units of TPM and top_genes
contains an ordered list of the top 100 associated genes for each program.
Output data files will all be available in the ./example_data/example_cNMF directory including:
example_data/example_cNMF/example_cNMF.gene_spectra_score.k_10.dt_0_01.txt
example_data/example_cNMF/example_cNMF.gene_spectra_tpm.k_10.dt_0_01.txt
example_data/example_cNMF/example_cNMF.usages.k_10.dt_0_01.consensus.txt
example_data/example_cNMF/example_cNMF.k_selection.png
example_data/example_cNMF/example_cNMF.clustering.k_10.dt_0_01.pdf
Some usage notes:
cnmf factorize --output-dir ./example_data --name example_cNMF --worker-index 0 --total-workers 3 &
cnmf factorize --output-dir ./example_data --name example_cNMF --worker-index 1 --total-workers 3 &
cnmf factorize --output-dir ./example_data --name example_cNMF --worker-index 2 --total-workers 3 &
would break the factorization jobs up into 3 batches and submit them independently. This can be used with compute clusters to run the factorizations in parallel (see tutorials for example).
See the tutorials or Stepwise_Guide.md for more details
We have implemented a pipeline to integrate batch variables prior to running cNMF and to handle ADTs in CITE-Seq. It uses an adaptation of Harmony that corrects the underlying count matrix rather than principal components. We describe it in our recent preprint. See the batch correction tutorial as well for an example.
We use a separate Preprocess class to run batch correction. You pass in an AnnData object, as well as harmony_vars, a list of the names of variables to correct correspond to columns in the AnnData obs attribute. You also specify an output file base name to save the results to like below:
from cnmf import cNMF, Preprocess
#Initialize the Preprocess object
p = Preprocess(random_seed=14)
#Batch correct the data and save the corrected high-variance gene data to adata_c, and the TPM normalized data to adata_tpm
(adata_c, adata_tpm, hvgs) = p.preprocess_for_cnmf(adata, harmony_vars=['Sex', 'Sample'], n_top_rna_genes = 2000, librarysize_targetsum= 1e6,
save_output_base='./example_islets/batchcorrect_example_sex')
#Then run cNMF passing in the corrected counts file, tpm_fn, and HVGs as inputs
cnmf_obj_corrected = cNMF(output_dir='./example_islets', name='BatchCorrected')
cnmf_obj_corrected.prepare(counts_fn='./example_islets/batchcorrect_example.Corrected.HVG.Varnorm.h5ad',
tpm_fn='./example_islets/batchcorrect_example.TP10K.h5ad',
genes_file='./example_islets/batchcorrect_example.Corrected.HVGs.txt',
components=[15], n_iter=20, seed=14, num_highvar_genes=2000)
#Then proceed with the rest of cNMF as normal
cnmf