A Snakemake 8 workflow implementation of the BSF's ATAC-seq Data Processing Pipeline extended by downstream quantification and annotation steps using Bash and Python.
[!NOTE]
This workflow adheres to the module specifications of MR.PARETO, an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project's repository.⭐️ Star and share modules you find valuable 📤 - help others discover them, and guide our future work!
[!IMPORTANT]
If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.6323634.
This project wouldn't be possible without the following software and their dependencies:
Software | Reference (DOI) |
---|---|
bedtools | https://doi.org/10.1093/bioinformatics/btq033 |
Bowtie2 | https://doi.org/10.1038/nmeth.1923 |
deeptools | https://doi.org/10.1093/nar/gkw257 |
ENCODE | https://doi.org/10.1038/s41598-019-45839-z |
fastp | https://doi.org/10.1093/bioinformatics/bty560 |
HOMER | https://doi.org/10.1016/j.molcel.2010.05.004 |
MACS2 | https://doi.org/10.1186/gb-2008-9-9-r137 |
MultiQC | https://doi.org/10.1093/bioinformatics/btw354 |
pybedtools | https://doi.org/10.1093/bioinformatics/btr539 |
pandas | https://doi.org/10.5281/zenodo.3509134 |
samblaster | https://doi.org/10.1093/bioinformatics/btu314 |
samtools | https://doi.org/10.1093/bioinformatics/btp352 |
Snakemake | https://doi.org/10.12688/f1000research.29032.2 |
UROPA | https://doi.org/10.1038/s41598-017-02464-y |
This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file
) or post-execution in the result directory (atacseq_pipeline/envs/*.yaml
). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].
Processing. Sequencing adapters were removed using the software fastp (ver) [ref]. Bowtie2 (ver) [ref] was used for the alignment of the short reads (representing locations of transposition events) to the [GRCh38 (hg38)/GRCm38 (mm10)] assembly of the [human/mouse] genome using the “--very-sensitive” parameter. PCR duplicates were marked using samblaster (ver) [ref]. Aligned BAM files were then sorted, filtered using ENCODE blacklisted regions [ref], samtools view flags [SAM_flag], and indexed using samtools (ver) [ref]. To detect the open chromatin regions, peak calling was performed using MACS2 (ver) [ref] using the “--nomodel”, “--keep-dup [macs2_keep_dup]” and “--extsize 147” options on each sample. HOMER (ver) [ref] function findMotifs was used for motif enrichment analysis of the detected open chromatin regions. Quality control metrics were aggregated and reported using MultiQC (ver) [ref], [X] sample(s) needed to be removed.
Quantification. A consensus region set, comprising of [X] genomic regions, was generated, by merging the identified peak summits, extended by [slop_extension]bp on both sides using the slop function from bedtools (ver) [ref] and pybedtools (ver) [ref], across all samples while again discarding peaks overlapping blacklisted features as defined by the ENCODE project [ref]. The consensus region set was used to quantify the chromatin accessibility in each sample by summing the number of reads overlapping each consensus region. The consensus region set, and sample-wise quantification of accessibility was performed using bedtools (ver) [ref] and pybedtools (ver) [ref]. Furthermore, the consensus region set was used to quantify the peak support per sample and each region was mapped to its closest TSS according to the HOMER annotation within proximal TSS up and downstream distances [proximal_size_up/down] yielding a gene/TSS-based quantification. Complementary, all promoter regions, defined by the same parameters, were quantified for each sample and aggregated to yield a gene/promoter-based quantification. Finally, all sample-wise enriched known motifs according to HOMER were aggregated.
Annotation. Consensus regions were annotated using annotatePeaks function from HOMER (ver) [ref]. Additionally, we annotated all consensus regions using UROPA (ver) [ref] and genomic features from the [GENCODE vX] basic gene annotation as: “TSS proximal” if the region’s midpoint was within [X] bp upstream or [X] bp downstream from a TSS, or if the region overlapped with a TSS; “gene body” if the region overlapped with a gene; “distal” if the region’s midpoint was within [X] bp of a TSS; and “intergenic” otherwise. For each region, only the closest feature was considered, and the annotations took precedence in the following order: TSS proximal, gene body, distal, and intergenic. Finally, bedtools was employed to quantify nucleotide counts and proportional content per consensus region.
The processing and quantification described here was performed using a publicly available Snakemake [ver] (ref) workflow [10.5281/zenodo.6323634].
results/
)
samtools view
can be configured using SAM Flags (SAM_flag
).MACS2
.
macs2_keep_dup
.report/
)
counts/
)
consensus_regions.bed
).consensus_counts.csv
).support_counts.csv
).TSS_regions.bed
, TSS_counts.csv
, TSS_annotation.csv
).promoter_regions.bed
, promoter_counts.csv
, promoter_annotation.csv
).
Y
are skipped.HOMER_knownMotifs.csv
).counts/
)
sample_annotation.csv
).consensus_annotation.csv
)
workflow/resources/UROPA/*.txt
.annotatePeaks.pl
[!IMPORTANT]
Duplciate reads can be filtered during the alignment step bysamtools
and/or ignored during peak calling byMACS2
. The inclusion of duplicates should be intentional, and may lead to a large number of consensus regions. The removal of duplicates should be intentional, might remove real biological signal. The decision depends on your downstream analysis steps e.g., rigorous filtering (e.g., usingedgeR::filterByExpr
) and/or accounting for PCR bias by normalization conditional on genomic region length and GC content (e.g., CQN) and goals (e.g., differential accessibility analysis). We recommend reading this ChIP-seq tutorial's section on "Removing redundancy".
These steps are the recommended usage for this workflow:
This workflow is written with Snakemake and its usage is described in the Snakemake Workflow Catalog.
Detailed specifications can be found here ./config/README.md
We provide data, annotation and configuration files for two example datasets (hg38 & mm10) in ./test. In both cases the data was generated for test purposes only by downsampling real ATAC-seq samples using samtools.
samtools view -s .0001 real_sample.bam -b > test_sample.bam
The pass_qc attribute is set 0 for all samples, because the downsampled data does not contain any peaks for downstream quantification.
Below are some guidelines for the manual quality control of each sample, but keep in mind that every experiment/dataset is different.
X
/Y
) for accessibility as positive controls.[!IMPORTANT]
Sometimes reads map toY
in females, becauseX
andY
chromosomes both have pseudoautosomal regions (PARs) that are common between the two chromosomes.
My personal QC value scheme to inform downstream analyses (e.g., unsupervised analysis)
Finally, a previous PhD student in our lab, André Rendeiro, wrote about "ATAC-seq sample quality, wet lab troubleshooting and advice".
Data Resources: To ensure the reproducibility of results and to make the workflow accessible we provide all required reference data for the analysis of ATAC-seq samples for human GRCh38 (hg38) and mouse GRCm38 (mm10) genomes on Zendodo.
# download Zenodo records using zenodo_get
# install zenodo_get v1.3.4
conda install -c conda-forge zenodo_get=1.3.4
# human GRCh38 (hg38)
zenodo_get --record 6344173 --output-dir=resources/atacseq_pipeline/hg38/
cd resources/atacseq_pipeline/hg38
unzip indices_for_Bowtie2.zip && rm indices_for_Bowtie2.zip
# mouse GRCm38 (mm10)
zenodo_get --record 6344322 --output-dir=resources/atacseq_pipeline/mm10/
cd resources/atacseq_pipeline/mm10
unzip indices_for_Bowtie2.zip && rm indices_for_Bowtie2.zip
The following publications successfully used this module for their analyses.