This repo contains a collection of scripts and notebooks that I use for initial pre-processing and dimensionality reduction of 10X scATAC-seq data. These handle steps that come after peak calling with the cellatac pipeline.
Input: after running cellatac
you will have a results folder. From that we will need the contents of results/peak_matrix
Set up: Make a new folder to store all your results
mkdir my_scATAC_dir
outdir=./my_scATAC_dir
cellatac_outs=results/peak_matrix
Install required R packages (other packages are installed directly in the notebooks)
install.packages(c("data.table","Signac", "tidyverse", "argparse"))
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
BiocManager::install(c("ensembldb", "EnsDb.Mmusculus.v79", "EnsDb.Hsapiens.v86","GenomicRanges"))
Install required python packages:
pip install scanpy anndata numpy pandas anndata2ri rpy2 scipy
1. Annotating peaks annotate_peaks.R
: This script uses functionality in GenomicRanges
to compute some basic stats on the peaks identified by cellatac
. We will use these for filtering and in later stages of the analysis.
In terminal:
Rscript annotate_peaks.R $cellatac_outs/peaks.txt $outdir --genome hg38
2. Make anndata and filter peaks N1_cellatac2anndata.ipynb
:
Here we make an anndata object, we filter out a lot of peaks and start saving different layers (loading the big matrix takes some time, be patient). The notebook saves an .h5ad
abject storing the raw counts and peak annotations for the dataset.
3. Run cisTopic N1_add_cistopic.ipynb
: cisTopic uses Latent Dirichlet Allocation (LDA) to perform dimensionality reduction on a binary (or sort of binary) matrix of peak x cell accessibility. LDA is a robust Bayesian method used in text mining to group documents addressing similar topics and related words into topics. cisTopic treats cells as documents and peaks as words to group cells where the same "regulatory topics" are active. Read all about it here.
Here we use the LDA model from cisTopic for 3 purposes
References for uses of cisTopic:
4. Match genes to proximal peaks N3_peak2genes.ipynb
A range of downstream analyses require to associate genes to peaks in their proximity (i.e. overlapping the gene body or less than n kbs away). Here we use functionality in the GenomicRanges
R package to create a sparse matrix of peak x gene assignment, where 1s indicate that a peak is in the proximity of the gene (N.B. A peak might be proximal to multiple genes). The default proximity window is 50 kbs.