scDamAndTools is a set of functions for the processing of raw data generated by scDam&T-seq, a method to simultaneously measure protein-DNA interactions and transcription from single cells (Rooijers et al., 2019). scDam&T-seq combines a DamID-based method to measure protein-DNA interactions and an adaptation of CEL-Seq to measure transcription. The starting point of the workflow is raw sequencing data and the end result are tables of UMI-unique DamID and CEL-Seq counts.
scDamAndTools was developed in Python 3.6.5 and is compatible with Python 3.6+.
We recommend users to install scDamAndTools and its requirements in a Python virtual environment. In order to generate a virtual environment:
python3 -m venv $HOME/.venvs/my_venv
To activate the virtual environment:
source ~/.venvs/my_venv/bin/activate
Install necessary packages:
pip install --upgrade pip wheel setuptools
pip install cython
pip install git+https://github.com/KindLab/scDamAndTools.git
scDamAndTools contains eleven functions for the processing of raw sequencing data generated by scDam&T-seq experiments:
function | description |
---|---|
add_read_prefix.awk |
Appends additional bases (e.g. "GA") to the beginning of reads. |
bin_damid_counts.py |
Bins UMI-unique DamID counts into genomically equal-sized bins. |
create_motif_arrays |
Wrapper function that combines all necessary steps to generate motif position and mappability arrays. |
demultiplex.py |
Demultiplexes all reads in a library based on their barcode. |
fetch_regions.py |
Generates in silico reads of a defined read length for all motif occurrences. |
find_motif_occurrences.py |
Finds motif (e.g. "GATC") occurrences genome-wide from a FASTA file. |
generate_celseq_counts.py |
Processes aligned CEL-Seq reads into a table of UMI-unique counts. |
generate_damid_counts.py |
Processes aligned DamID reads into a table of UMI-unique counts. |
process_celseq_reads |
Wrapper function that first alignes the CELseq reads and subsequently calls generate_celseq_counts.py . |
process_damid_reads |
Wrapper function that first alignes the DamID reads and subsequently calls generate_damid_counts.py . |
write_posarray.py |
Generates a table with the positions of a motif genome-wide. |
add_read_prefix.awk -v PREFIX="GA" infile
Takes a FASTQ file as an input and appends the specified prefix to the beginning of all reads. The output in FASTQ format and is directed to STDOUT.
arguments | option | description |
---|---|---|
infile |
required | FASTQ file with sequence reads. Alternatively, reads can be piped via STDIN. |
-v PREFIX= |
required | Sequence to be appended to the beginning of all reads. |
bin_damid_counts.py [-h] [--verbose] [--quiet] --outfile OUTFILE [--binsize INT] --posfile HDF5_FILE [--mapfile HDF5_FILE] infiles [infiles ...]
Takes one or more position-wise UMI-unique DamID count tables (HDF5) as an input and outputs a binned count table with binsizes of the specified size. All input files are summed in the final output file.
arguments | option | description |
---|---|---|
infiles |
required | HDF5 file(s) of position-wise UMI-unique DamID count table(s). |
--outfile |
required | HDF5 file name to write output to. |
--binsize |
optional | Size of the genomic bins into which data will be summarized [default: 2000]. |
--posfile |
required | Location of the motif position array (HDF5). |
--mapfile |
required | Location of the motif mappability array (HDF5). |
-v/--verbose |
optional | Set level of logging messages. |
-q/--quiet |
optional | Silence all logging messages. |
create_motif_arrays [-h] [-m MOTIF] -r READLENGTH -x HISAT2_INDEX -o OUTPREFIX fastafile
A wrapper function that combines all necessary steps to generate a motif position and mappability array. It takes the FASTA file of the relevant genome as input and generates a BED file and HDF5 file indicating the genome-wide positions of the motif (e.g. "GATC") and generates a HDF5 file that indicates whether each motif occurrence can be aligned based on the expected readlength.
arguments | option | description |
---|---|---|
fastafile |
required | FASTA file of the relevant genomic reference. |
-m |
optional | The sequence motif targeted by the used restriction enzyme [default: GATC]. |
-o |
required | The prefix (including path) for the position and mappability files. |
-r |
required | The expected readlength of the sequencing reads (specifically R1), excluding barcode and UMI. |
-x |
required | HISAT2 reference index. |
demultiplex.py [-h] [--verbose] [--quiet] -o OUTFMT [-m MISMATCHES] [-k] [--verify-no-ambiguous] [--unmatched-outfile FILE] [--ambiguous-outfile FILE] [--infofile FILE] bcfile infiles [infiles ...]
Takes one or more raw sequencing files (FASTQ) as an input and outputs a file (FASTQ) for each DamID/CEL-Seq barcode provided in the barcode file containing all reads matching that barcode.
arguments | option | description |
---|---|---|
bcfile |
required | TAB-separated file with DamID and CEL-Seq barcode names and barcode sequences corresponding to the samples in the library. |
infiles |
required | FASTQ files of raw sequencing reads. |
-o/--outfmt |
required | File name format for output files with demultiplexed reads per sample. Should contain a "{name}" field for the barcode name (and a {readname} field for PE input). |
-m/--mismatches |
optional | The maximum number of mismatches allowed between the provided barcodes and observed sequences [default: 1]. |
-k/--keep |
optional | If set, the barcode sequences are not trimmed from the reads [default: False]. |
--verify-no-ambiguous |
optional | If set, verifies whether two or more provided barcode sequences are inherently ambiguous given the number of allowed mismatches [default: False]. |
--unmatched-outfile |
optional | Output file (BAM) for reads in the library that could not be matched to any barcode. In the case of PE input, should contain a "{readname}" field. |
--ambiguous-outfile |
optional | Output file (BAM) for reads in the library that match multiple barcodes. In the case of PE input, should contain a "{readname}" field. |
--infofile |
optional | File in which to report demultiplexing stats. |
-v/--verbose |
optional | Set level of logging messages. |
-q/--quiet |
optional | Silence all logging messages. |
fetch_regions.py [-h] -l LENGTH [--verbose] [--quiet] bedfile fastafile
Takes a file (BED) of genome-wide motif occurrences and a reference genome sequence (FASTA) as input and generates in silico reads based on motif occurrences and desired read length.
arguments | option | description |
---|---|---|
bedfile |
required | BED file containing the motif occurrences genome-wide. |
fastafile |
required | FASTA file containing the reference genome sequence. |
-l/--length |
required | Length (in bp) of the in silico reads to be generated. |
-v/--verbose |
optional | Set level of logging messages. |
-q/--quiet |
optional | Silence all logging messages. |
find_motif_occurrences.py [-h] filename motif
Takes a (gzipped) genome reference file (FASTA) as input and finds all positions at which the provided motif occurs. Outputs the found positions in BED format to STDOUT. Note, does not handle revcomp well. But that doesn't matter if your motif is itself palindromic.
arguments | option | description |
---|---|---|
filename |
required | Genome reference file name (FASTA). |
motif |
required | Motif to be found (e.g. "GATC"). |
generate_celseq_counts.py [-h] [--verbose] [--quiet] --gtf GTF [--mode {intersection-strict,intersection-nonempty,union}] [--min-mapq MIN_MAPQ] [--outfile HDF5_FILE] bamfile [bamfile ...]
Takes an alignment file (BAM) as input and generates a table of UMI-unique transcript counts in HDF5 format, containing the number of observed UMI-unique transcripts per gene, sorted by their ENSEMBL ID as provided in the GTF file. Whether an alignment is considered to overlap a feature is determined by the choice of overlap mode. The available modes are modeled after the HTseq modes.
arguments | option | description |
---|---|---|
bamfile |
required | Input BAM file(s) of aligned CEL-Seq reads. |
-g/--gtf |
required | Reference GTF file. |
-m/--mode |
required | Overlap mode for aligning reads to features. Based on HTseq. Options: {intersection-strict, intersection-nonempty, union}. |
--min-mapq |
optional | Minimum mapping quality of aligned reads. Reads below the threshold are discarded [default: 0]. |
-o/--outfile |
optional | Output file (HDF5) to store UMI-unique transcript counts. If not provided, output is send to STDOUT as a table of geneid and associated counts. |
--save-stats |
optional | If set, a text file containing the CEL-Seq counting statistics will be generated [default: False]. |
-v/--verbose |
optional | Set level of logging messages. |
-q/--quiet |
optional | Silence all logging messages. |
generate_damid_counts.py [-h] [--verbose] [--quiet] --outfile OUTFILE [--invalid-outfile INVALID_OUTFILE] [--min-mapq MIN_MAPQ] [--umi-length UMI_LENGTH] [--keep-n KEEP_N] [--min-editdistance MIN_EDITDISTANCE] --pos-file HDF5_FILE infile
Takes an alignment file (BAM) as input and generates a table of UMI-unique GATC counts (HDF5).
arguments | option | description |
---|---|---|
infile |
required | BAM input file of aligned DamID reads sorted by genomic position. Use "-" when reads are piped from STDIN. |
--pos-file |
required | Motif position file (HDF5). |
--outfile |
required | Output file (HDF5) to which UMI-unique counts are written. |
--invalid-outfile |
optional | Output file (BAM) to store invalid reads (i.e. too low mapping quality or not aligning to a GATC). |
--min-mapq |
optional | Minimum mapping quality of aligned reads. Reads below the threshold are discarded [default: 0]. |
--umi-present |
optional | If set, available UMI information will be used to remove PCR duplicates [default: False]. |
--keep-n |
optional | Maximum number of unique UMIs to allow per position [default: keep all]. |
--min-editdistance |
optional | Minimum edit distance between pairs of UMIs that are observed per position [default: 1]. |
--save-stats |
optional | If set, a text file containing the DamID counting statistics will be generated [default: False]. |
-v/--verbose |
optional | Set level of logging messages. |
-q/--quiet |
optional | Silence all logging messages. |
process_celseq_reads [-h] -o OUTPREFIX -g GTF -x HISAT2_INDEX fastqfile
Wrapper script that aligns CEL-Seq reads (using HISAT2) and subsequently calls generate_celseq_counts.py
. The function takes CEL-Seq reads (FASTQ) as input and outputs a alignment file (BAM) and count file (HDF5), containing the number of observed UMI-unique transcripts per gene, sorted by their ENSEMBL ID as provided in the GTF file.
arguments | option | description |
---|---|---|
fastqfile |
required | FASTQ file containing CEL-Seq reads. |
-o |
required | The prefix (including path) for the generated alignment and count file. |
-g |
required | GTF file for the relevant genome reference. |
-x |
required | HISAT2 index for the relevant genome reference. |
-q |
optional | Minimum mapping quality to retain read [default: 10]. |
process_damid_reads [-h] [-m PREFIX] [-q MIN_MAPQ] [-u] [-k KEEP_N] [-d MIN_EDITDISTANCE] -o OUTPREFIX -p POSARRAY -x HISAT2_INDEX fastqfile
Wrapper script that aligns DamID reads (using HISAT2) and subsequently calls generate_damid_counts.py
. The function takes DamID reads (FASTQ) as input and outputs a alignment file (BAM) and count file (HDF5), containing the number of observed (UMI-unique) counts per motif occurrence.
arguments | option | description |
---|---|---|
fastqfile |
required | FASTQ file containing DamID reads. |
-o |
required | The prefix (including path) for the generated alignment and count file. |
-p |
required | Motif position file for the relevant genome reference. |
-x |
required | HISAT2 index for the relevant genome reference. |
-m |
optional | Motif prefix to append to reads [default: GA]. |
-q |
optional | Minimum mapping quality of aligned reads. Reads below the threshold are discarded [default: 0]. |
-u |
optional | If set, available UMI information will be used to remove PCR duplicates [default: False]. |
-k |
optional | Maximum number of unique UMIs to allow per position [default: 4]. |
-d |
optional | Minimum edit distance between pairs of UMIs that are observed per position [default: 2]. |
write_posarray.py [-h] [-l ELEMENT_LENGTH] --outfile OUTFILE pos_file
Takes a BED file describing all motif occurrences and generates a table in HDF5 format with all motif positions genome-wide. For each chromosome, the first entry is the position of the first motif occurrence on the plus strand, subsequent values represent the distance to the next motif. Only positions on the plus strand are indicated, motif positions on the minus strand can be inferred.
arguments | option | description |
---|---|---|
pos_file |
required | BED input file with motif positions. |
--outfile |
required | Output file name (HDF5) for posarray. |
-l/--element-length |
optional | Length of the motif in bp (e.g. 4 for the GATC motif) [default: infer length from input]. |
The analysis steps in this and subsequent sections demonstrate how scDam&T-seq data can be analysed using the provided software package (scDamAndTools). File names and genome references are chosen to match the test data available as part of the package. Care should be taken to modify these when applying the analysis on other data. In some parts of the workflow, we offer the option to use a wrapper script to combine multiple processing steps. Although these wrapper scripts ease the workflow, they are less customizable. When applying scDamAndTools in a custom workflow, the individual steps offer more flexibility.
Many of the functions provided in the scDamAndTools package are Python3-based and can be easily incorporated in a custom workflow. An exception this are the wrapper functions "create_motif_refarrays", "process_celseq_reads" and "process_damid_reads", which are bash scripts. The workflow as explained below was developed using the following environment and tools:
We have included a test dataset set as part of this repository. In the following steps, we will demonstrate how the scDam&Tools package can be implemented to analyse this data. The data comes from a mouse Embryonic Stem Cell (mESC) line expressing Dam-LaminB1. As such, we expect that the resulting DamID profiles show Lamina-Associated Domains (LADs). The raw sequencing data contains downsampled DamID and CEL-Seq reads of five single cells.
Step 1: For the purpose of this tutorial, the FASTA file, GTF file and HISAT2 index will all be placed in a folder named “references”. For that reason, generate this folder in your own working directory, or replace all mentions of the “references” directory with the path of your own choice:
mkdir ./references
Step 2: Download the reference sequence of the relevant species, for example from the ensembl webpage. Click the “DNA (FASTA)” link and download the file ending with ‘dna.primary_assembly.fa.gz’. Unzip downloaded file and place it in the “references” directory.
Step 3: Download the GTF file for the relevant species, for example from the ensembl webpage. Unzip downloaded file and place it in the “references” directory.
Step 4: If using ERCC spike-ins, their sequences should be added to the genome FASTA file and the GTF file.
Step 5: Generate a HISAT2 index for future alignment of both DamID and CEL-Seq2 derived reads. In order for the HISAT2 index to optimally align spliced transcripts, it is necessary to provide information on exon and splice site locations. These can be extracted from the GTF file with scripts provided by HISAT2. Incorporating splice site and exon information requires a lot of working memory, approximately 200 Gb RAM for the human genome. If this memory is not available, HISAT2 indices for the most commonly used genomes can also be directly downloaded from the HISAT2 website (http://daehwankimlab.github.io/hisat2/download/). To build the index yourself:
GTF="./references/Mus_musculus.GRCm38.98.with_ERCC.gtf"
FASTAFN="./references/Mus_musculus.GRCm38.dna.primary_assembly.with_ERCC.fa"
HISAT2_INDEX="./references/Mus_musculus.GRCm38.dna.primary_assembly.with_ERCC"
SSFN="./references/Mus_musculus.GRCm38.dna.primary_assembly.with_ERCC.splice_sites.txt"
EXONFN="./references/Mus_musculus.GRCm38.dna.primary_assembly.with_ERCC.exons.txt"
hisat2_extract_splice_sites.py $GTFFN > $SSFN hisat2_extract_exons.py $GTFFN > $EXONFN hisat2-build --ss $SSFN --exon $EXONFN $FASTAFN $HISAT2_INDEX
### Generate GATC reference arrays
To efficiently match obtained DamID reads to specific instances of the GATC motif in the genome, we generate two reference arrays. The first array (“position array” or “posarray”) contains the positions of all GATC positions in the genome. The second array (“mappability array” or “maparray”) indicates whether it is possible to uniquely align a read derived from a particular (strand-specific) GATC instance. The mappability array is used to filter out ambiguously aligning GATCs and can serve as an indicator of the (mappable) GATC density along the chromosomes. During the generation of the mappability array, in silico reads are generated for each GATC instance and are subsequently mapped back to the reference genome. The length of the reads should be chosen to be the same as the length of the reads obtained in the experiment (excluding the adapter).
#### Option 1: Using the wrapper script
- *Step 6a.1:* Generate the motif position and mappability arrays:
IN_SILICO_READLENGTH=62 ARRAY_PREFIX="./refarrays/Mus_musculus.GRCm38.dna.primary_assembly" create_motif_refarrays \ -m "GATC" \ -o $ARRAY_PREFIX \ -r $IN_SILICO_READLENGTH \ -x $HISAT2_INDEX \ $FASTAFN
#### Option 2: Performing the individual steps
- *Step 6b.1:* Find “GATC” occurrences in genome:
mkdir ./refarrays/ find_motif_occurrences.py \ $FASTAFN \ "GATC" \ | awk '$1 !~ /^ERCC/' \ | gzip > ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.positions.bed.gz
GATC positions in the ERCC sequences are discarded, since these are not present in the genomic DNA of the cells.
- *Step 6b.2:* Generate an HDF5 file with all GATC positions in the genome:
write_posarray.py \ --outfile ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.posarray.hdf5 \ ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.positions.bed.gz
- *Step 6b.3:* Generate _in silico_ reads mapping to GATC positions and align these back to the reference to determine the mappability of the positions:
IN_SILICO_READLENGTH=62; fetch_regions.py \ -q \ -l $IN_SILICO_READLENGTH \ ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.positions.bed.gz \ $FASTAFN \ | hisat2 \ --seed 42 \ -f \ -x $HISAT2_INDEX \ -U - \ --no-spliced-alignment \ --mp '2,0' \ --sp '4,0' \ | samtools view -ub - \ | samtools sort -m 500M -l 9 \
./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.readlength_62.sorted.bam
generate_damid_counts.py \
-vvv \
--outfile ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.readlength_62.maparray.hdf5 \
--pos-file ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.posarray.hdf5 \
./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.readlength_62.sorted.bam
At the end of these steps, four files should have been generated, ending in ending in “.positions.bed.gz” (all occurrences of the GATC motif in BED format), “.posarray.hdf5” (all occurrences of the GATC motif as a HDF5 array), ".sorted.bam" (alignment of the in silico generated reads in BAM format) and “.maparray.hdf5” (the mappability of all GATC motifs as a HDF5 array). In case the wrapper script was used, the alignment file (BAM) has been automatically deleted.
In this step, the raw sequencing reads are split into multiple files based on their barcodes. For each barcode, a separate file will be made. Reads that do not match any barcode or that match multiple barcodes can be stored in separated files.
Step 7: If not already done so by the sequencing facility, demultiplex the raw data based on the used Illumina indices.
Step 8: Create a file that describes the barcodes that were used in the library. The file should have two columns, listing the adapter names and sequences respectively. The location and length of UMIs should be indicated with numbers and dashes. The DamID and CEL-Seq barcodes in this experiment follow a design with 3bp UMI - 4bp barcode - 3bp UMI - 4bp barcode, with an additional "GA" at the end of the DamID barcodes. Using such a design, the barcode file of a library with two samples should look as follows:
DamID_BC_001 3-TGCT-3-GAGAGA
DamID_BC_002 3-ATTG-3-GAACGA
CELseq_BC_001 3-ACAG-3-AGGC
CELseq_BC_002 3-GTCT-3-GCCA
Example data and relevant barcode file are included in the scDamAndTools package in the tutorial folder.
NOTE: There may be multiple raw sequencing files pertaining to the same samples, for example from the different sequencing lanes. These files should be concatenated, latest prior to the sorting of aligned reads.
OUTFMT="./data/demultiplexed/index01.{name}.{readname}.fastq.gz"
INFOFN="./data/demultiplexed/index01.demultiplex_info.txt"
demultiplex.py \
-vvv \
--mismatches 0 \
--outfmt $OUTFMT \
--infofile $INFOFN \
./metadata/index01.barcodes.tsv \
./data/raw/index01_R1_001.fastq.gz \
./data/raw/index01_R2_001.fastq.gz
The demultiplex script generates a separate FASTQ file for each barcode provided in the barcode information file (see step 8) that contains all reads matching this barcode. In addition, a text file (“index01.demultiplex_info.txt”) is generated that details the number of reads matched to each barcode:
ambiguous 0
matching_mismatch 0
matching_perfect 450000
unmatching 0
CELseq_BC_001 724 0
CELseq_BC_002 11087 0
CELseq_BC_003 12096 0
CELseq_BC_004 14603 0
CELseq_BC_005 9696 0
DamID_BC_001 102140 0
DamID_BC_002 31597 0
DamID_BC_003 111963 0
DamID_BC_004 46903 0
DamID_BC_005 109191 0
In this case, there are no reads ambiguous or unmatched barcodes, since the test dataset represents a downsampled version of several samples that have previously been selected on their barcode. In a typical experiment, 5-10% of reads cannot be unambiguously matched to any barcode.
The subsequent steps need to be performed on all DamID demultiplexed files. It is highly recommended that this process be parallelized on a high-performance computing cluster. The amount of time necessary for these steps depends entirely on the number of libraries, samples per library and available computing cores.
Step 10a.1: Process the DamID reads to arrays of (UMI-unique) GATC counts. The script aligns the DamID reads to the genome and subsequently matches them to positions as indicated in the position array (see step 6). Since the GATC motif is cleaved in half by DpnI, the prefix “GA” is added to all reads prior to alignment. PCR-duplicates are filtered out based on the available UMI information. For this step, only the R1 reads are used since these contain the fragment aligning to the GATC motif:
OUTPREFIX="./data/damid/index01.DamID_BC_001";
POSARRAY="./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.posarray.hdf5";
process_damid_reads \
-o $OUTPREFIX \
-m "GA" \
-p $POSARRAY \
-x $HISAT2_INDEX \
-u \
./data/demultiplexed/index01.DamID_BC_001.R1.fastq.gz
The script generates an alignment file ending in “.sorted.bam”, a GATC count file ending in “.counts.hdf5” and an information file ending in “.counts.stats.tsv”.
Step 10a.2: Bin the GATC count files into genomically equal-sized bins. The resulting HDF5 file contains for each chromosome the number of observed UMI-unique counts for each bin:
MAPARRAY="./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.readlength_62.maparray.hdf5"
OUTFN="./data/damid/index01.DamID_BC_001.counts.binsize_100000.hdf5"
bin_damid_counts.py \
-vvv \
--mapfile $MAPARRAY \
--posfile $POSARRAY \
--binsize 100000 \
--outfile $OUTFN \
./data/damid/index01.DamID_BC_001.counts.hdf5
Step 10b.1: The restriction enzyme DpnI cuts the GATC motif in half and consequently the obtained reads start with “TC”. Add the “GA” prefix to all reads and align to the reference genome:
mkdir ./data/damid/
POSARRAY="./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.posarray.hdf5";
gzip -dc ./data/demultiplexed/index01.DamID_BC_001.R1.fastq.gz \
| add_read_prefix.awk -v PREFIX="GA" \
| hisat2 \
--seed 42 \
-x $HISAT2_INDEX \
-U - \
--no-spliced-alignment \
--mp '2,0' \
--sp '4,0' \
| samtools view -ub -
| samtools sort
-m 500M
-T ./data/damid/index01.DamID_BC_001.sort_tmp
-l 9
-
> ./data/damid/index01.DamID_BC_001.sorted.bam
Step 10b.2: Generate UMI-unique GATC count files from the alignments. The resulting HDF5 file contains for each chromosome the number of observed UMI-unique counts for each strand-specific GATC:
generate_damid_counts.py \
-vvv \
--outfile "./data/damid/index01.DamID_BC_001.counts.hdf5" \
--min-mapq 10 \
--umi-present \
--keep-n 4 \
--min-editdistance 2 \
--pos-file $POSARRAY \
--save-stats \
./data/damid/index01.DamID_BC_001.sorted.bam
Step 10b.3: Bin the GATC count files into genomically equal-sized bins. The resulting HDF5 file contains for each chromosome the number of observed UMI-unique counts for each bin:
MAPARRAY="./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.readlength_65.maparray.hdf5"
OUTFN="./data/damid/index01.DamID_BC_001.counts.binsize_100000.hdf5"
bin_damid_counts.py \
-vvv \
--mapfile $MAPARRAY \
--posfile $POSARRAY \
--binsize 100000 \
--outfile $OUTFN \
./data/counts/index01.DamID_BC_001.counts.hdf5
After running the above steps, the following files have been generated:
index01.DamID_BC_001.counts.binsize_100000.hdf5
index01.DamID_BC_001.counts.hdf5
index01.DamID_BC_001.counts.stats.tsv
index01.DamID_BC_001.sorted.bam
The BAM file contains the aligned DamID reads, the HDF5 file ending in ".counts.hdf5" contains the number of unique fragments observed per strand-specific GATC positions, the HDF5 file ending in ".binsize_100000.hdf5" contains the number of observed unique fragments per 100kb bin. Finally, the text file ending in ".stats.tsv" contains the processing statistics:
index01.DamID_BC_001.counts.hdf5 total_reads 102140
index01.DamID_BC_001.counts.hdf5 unmapped_reads 20280
index01.DamID_BC_001.counts.hdf5 lowmapq_reads 6705
index01.DamID_BC_001.counts.hdf5 nongatc_reads 3876
index01.DamID_BC_001.counts.hdf5 invalidumi_reads 3
index01.DamID_BC_001.counts.hdf5 valid_reads 71276
index01.DamID_BC_001.counts.hdf5 unique_counts 53612
Plotting the binned counts observed over chromosome 19 yields the following enrichment profiles:
These enrichment profiles are in line with the output we expect for a Dam-LaminB1 cell line, albeit somewhat sparse.
The subsequent steps need to be performed on all CEL-Seq demultiplexed files. It is highly recommended that this process be parallelized on a high-performance computing cluster. The amount of time necessary for these steps depends entirely on the number of libraries, samples per library and available computing cores.
OUTPREFIX="./data/celseq/index01.CELseq_BC_001"
GTF="./references/Mus_musculus.GRCm38.98.with_ERCC.gtf"
process_celseq_reads \
-o $OUTPREFIX \
-g $GTF \
-x $HISAT2_INDEX \
./data/demultiplexed/index01.CELseq_BC_001.R2.fastq.gz
Step 11b.1: Align all CEL-Seq reads to the reference genome:
hisat2 \
--seed 42 \
-x $HISAT2_INDEX \
-U ./data/demultiplexed/index01.CELseq_BC_001.R2.fastq.gz \
--mp '2,0' \
--sp '4,0' \
> ./data/celseq/index01.CELseq_BC_001.bam
Step 11b.2: From the aligned CEL-Seq reads, generate an HDF5 file that contains for each gene the number of UMI-unique transcripts that were observed:
GTF="./references/Mus_musculus.GRCm38.98.with_ERCC.gtf"
generate_celseq_counts.py \
-vvv \
--gtf $GTFFN \
--mode intersection-strict \
--min-mapq 10 \
--outfile ./data/celseq/index01.CELseq_BC_001.counts.hdf5 \
--save-stats \
./data/celseq/index01.CELseq_BC_001.bam
After running the above steps, the following files have been generated:
index01.CELseq_BC_001.bam
index01.CELseq_BC_001.counts.hdf5
index01.CELseq_BC_001.counts.stats.tsv
The BAM file contains the aligned CEL-Seq reads, while the HDF5 file contains the number of observed UMI-unique transcripts per gene, sorted by their Ensembl gene ID as provided in the GTF file. Finally, the text file ending in ".stats.tsv" contains the processing statistics:
index01.CELseq_BC_001.counts.hdf5 total_reads 724
index01.CELseq_BC_001.counts.hdf5 valid_reads 73
index01.CELseq_BC_001.counts.hdf5 ambiguous_reads 0
index01.CELseq_BC_001.counts.hdf5 no_feature_reads 247
index01.CELseq_BC_001.counts.hdf5 unmapped_reads 333
index01.CELseq_BC_001.counts.hdf5 lowmapq_reads 71
index01.CELseq_BC_001.counts.hdf5 unique_counts 67
scDamAndTools is licensed under MIT. The detailed license terms are in the LICENSE file.
Original research paper describing scDam&T-seq:
Elaborate description of protocol and analysis: