KindLab / scDamAndTools

A set of functions for the processing of raw data generated by scDam&T-seq.
MIT License
3 stars 0 forks source link

scDamAndTools

scDamAndTools is a set of functions for the processing of raw data generated by scDam&T-seq, a method to simultaneously measure protein-DNA interactions and transcription from single cells (Rooijers et al., 2019). scDam&T-seq combines a DamID-based method to measure protein-DNA interactions and an adaptation of CEL-Seq to measure transcription. The starting point of the workflow is raw sequencing data and the end result are tables of UMI-unique DamID and CEL-Seq counts.

Table of contents

  1. Installation
  2. Available functions
  3. Example usage
  4. License
  5. References

Installation

scDamAndTools was developed in Python 3.6.5 and is compatible with Python 3.6+.

Setting up a virtual environment

We recommend users to install scDamAndTools and its requirements in a Python virtual environment. In order to generate a virtual environment:

python3 -m venv $HOME/.venvs/my_venv

To activate the virtual environment:

source ~/.venvs/my_venv/bin/activate

Requirements

Install necessary packages:

pip install --upgrade pip wheel setuptools
pip install cython

Installing scDamAndTools

pip install git+https://github.com/KindLab/scDamAndTools.git

Available Functions

scDamAndTools contains eleven functions for the processing of raw sequencing data generated by scDam&T-seq experiments:

function description
add_read_prefix.awk Appends additional bases (e.g. "GA") to the beginning of reads.
bin_damid_counts.py Bins UMI-unique DamID counts into genomically equal-sized bins.
create_motif_arrays Wrapper function that combines all necessary steps to generate motif position and mappability arrays.
demultiplex.py Demultiplexes all reads in a library based on their barcode.
fetch_regions.py Generates in silico reads of a defined read length for all motif occurrences.
find_motif_occurrences.py Finds motif (e.g. "GATC") occurrences genome-wide from a FASTA file.
generate_celseq_counts.py Processes aligned CEL-Seq reads into a table of UMI-unique counts.
generate_damid_counts.py Processes aligned DamID reads into a table of UMI-unique counts.
process_celseq_reads Wrapper function that first alignes the CELseq reads and subsequently calls generate_celseq_counts.py.
process_damid_reads Wrapper function that first alignes the DamID reads and subsequently calls generate_damid_counts.py.
write_posarray.py Generates a table with the positions of a motif genome-wide.

add_read_prefix.awk

add_read_prefix.awk -v PREFIX="GA" infile

Takes a FASTQ file as an input and appends the specified prefix to the beginning of all reads. The output in FASTQ format and is directed to STDOUT.

arguments option description
infile required FASTQ file with sequence reads. Alternatively, reads can be piped via STDIN.
-v PREFIX= required Sequence to be appended to the beginning of all reads.

bin_damid_counts.py

bin_damid_counts.py [-h] [--verbose] [--quiet] --outfile OUTFILE [--binsize INT] --posfile HDF5_FILE [--mapfile HDF5_FILE] infiles [infiles ...]

Takes one or more position-wise UMI-unique DamID count tables (HDF5) as an input and outputs a binned count table with binsizes of the specified size. All input files are summed in the final output file.

arguments option description
infiles required HDF5 file(s) of position-wise UMI-unique DamID count table(s).
--outfile required HDF5 file name to write output to.
--binsize optional Size of the genomic bins into which data will be summarized [default: 2000].
--posfile required Location of the motif position array (HDF5).
--mapfile required Location of the motif mappability array (HDF5).
-v/--verbose optional Set level of logging messages.
-q/--quiet optional Silence all logging messages.

create_motif_arrays

create_motif_arrays [-h] [-m MOTIF] -r READLENGTH -x HISAT2_INDEX -o OUTPREFIX fastafile

A wrapper function that combines all necessary steps to generate a motif position and mappability array. It takes the FASTA file of the relevant genome as input and generates a BED file and HDF5 file indicating the genome-wide positions of the motif (e.g. "GATC") and generates a HDF5 file that indicates whether each motif occurrence can be aligned based on the expected readlength.

arguments option description
fastafile required FASTA file of the relevant genomic reference.
-m optional The sequence motif targeted by the used restriction enzyme [default: GATC].
-o required The prefix (including path) for the position and mappability files.
-r required The expected readlength of the sequencing reads (specifically R1), excluding barcode and UMI.
-x required HISAT2 reference index.

demultiplex.py

demultiplex.py [-h] [--verbose] [--quiet] -o OUTFMT [-m MISMATCHES] [-k] [--verify-no-ambiguous] [--unmatched-outfile FILE] [--ambiguous-outfile FILE] [--infofile FILE] bcfile infiles [infiles ...]

Takes one or more raw sequencing files (FASTQ) as an input and outputs a file (FASTQ) for each DamID/CEL-Seq barcode provided in the barcode file containing all reads matching that barcode.

arguments option description
bcfile required TAB-separated file with DamID and CEL-Seq barcode names and barcode sequences corresponding to the samples in the library.
infiles required FASTQ files of raw sequencing reads.
-o/--outfmt required File name format for output files with demultiplexed reads per sample. Should contain a "{name}" field for the barcode name (and a {readname} field for PE input).
-m/--mismatches optional The maximum number of mismatches allowed between the provided barcodes and observed sequences [default: 1].
-k/--keep optional If set, the barcode sequences are not trimmed from the reads [default: False].
--verify-no-ambiguous optional If set, verifies whether two or more provided barcode sequences are inherently ambiguous given the number of allowed mismatches [default: False].
--unmatched-outfile optional Output file (BAM) for reads in the library that could not be matched to any barcode. In the case of PE input, should contain a "{readname}" field.
--ambiguous-outfile optional Output file (BAM) for reads in the library that match multiple barcodes. In the case of PE input, should contain a "{readname}" field.
--infofile optional File in which to report demultiplexing stats.
-v/--verbose optional Set level of logging messages.
-q/--quiet optional Silence all logging messages.

fetch_regions.py

fetch_regions.py [-h] -l LENGTH [--verbose] [--quiet] bedfile fastafile

Takes a file (BED) of genome-wide motif occurrences and a reference genome sequence (FASTA) as input and generates in silico reads based on motif occurrences and desired read length.

arguments option description
bedfile required BED file containing the motif occurrences genome-wide.
fastafile required FASTA file containing the reference genome sequence.
-l/--length required Length (in bp) of the in silico reads to be generated.
-v/--verbose optional Set level of logging messages.
-q/--quiet optional Silence all logging messages.

find_motif_occurrences.py

find_motif_occurrences.py [-h] filename motif

Takes a (gzipped) genome reference file (FASTA) as input and finds all positions at which the provided motif occurs. Outputs the found positions in BED format to STDOUT. Note, does not handle revcomp well. But that doesn't matter if your motif is itself palindromic.

arguments option description
filename required Genome reference file name (FASTA).
motif required Motif to be found (e.g. "GATC").

generate_celseq_counts.py

generate_celseq_counts.py [-h] [--verbose] [--quiet] --gtf GTF [--mode {intersection-strict,intersection-nonempty,union}] [--min-mapq MIN_MAPQ] [--outfile HDF5_FILE] bamfile [bamfile ...]

Takes an alignment file (BAM) as input and generates a table of UMI-unique transcript counts in HDF5 format, containing the number of observed UMI-unique transcripts per gene, sorted by their ENSEMBL ID as provided in the GTF file. Whether an alignment is considered to overlap a feature is determined by the choice of overlap mode. The available modes are modeled after the HTseq modes.

arguments option description
bamfile required Input BAM file(s) of aligned CEL-Seq reads.
-g/--gtf required Reference GTF file.
-m/--mode required Overlap mode for aligning reads to features. Based on HTseq. Options: {intersection-strict, intersection-nonempty, union}.
--min-mapq optional Minimum mapping quality of aligned reads. Reads below the threshold are discarded [default: 0].
-o/--outfile optional Output file (HDF5) to store UMI-unique transcript counts. If not provided, output is send to STDOUT as a table of geneid and associated counts.
--save-stats optional If set, a text file containing the CEL-Seq counting statistics will be generated [default: False].
-v/--verbose optional Set level of logging messages.
-q/--quiet optional Silence all logging messages.

generate_damid_counts.py

generate_damid_counts.py [-h] [--verbose] [--quiet] --outfile OUTFILE [--invalid-outfile INVALID_OUTFILE] [--min-mapq MIN_MAPQ] [--umi-length UMI_LENGTH] [--keep-n KEEP_N] [--min-editdistance MIN_EDITDISTANCE] --pos-file HDF5_FILE infile

Takes an alignment file (BAM) as input and generates a table of UMI-unique GATC counts (HDF5).

arguments option description
infile required BAM input file of aligned DamID reads sorted by genomic position. Use "-" when reads are piped from STDIN.
--pos-file required Motif position file (HDF5).
--outfile required Output file (HDF5) to which UMI-unique counts are written.
--invalid-outfile optional Output file (BAM) to store invalid reads (i.e. too low mapping quality or not aligning to a GATC).
--min-mapq optional Minimum mapping quality of aligned reads. Reads below the threshold are discarded [default: 0].
--umi-present optional If set, available UMI information will be used to remove PCR duplicates [default: False].
--keep-n optional Maximum number of unique UMIs to allow per position [default: keep all].
--min-editdistance optional Minimum edit distance between pairs of UMIs that are observed per position [default: 1].
--save-stats optional If set, a text file containing the DamID counting statistics will be generated [default: False].
-v/--verbose optional Set level of logging messages.
-q/--quiet optional Silence all logging messages.

process_celseq_reads

process_celseq_reads [-h] -o OUTPREFIX -g GTF -x HISAT2_INDEX fastqfile

Wrapper script that aligns CEL-Seq reads (using HISAT2) and subsequently calls generate_celseq_counts.py. The function takes CEL-Seq reads (FASTQ) as input and outputs a alignment file (BAM) and count file (HDF5), containing the number of observed UMI-unique transcripts per gene, sorted by their ENSEMBL ID as provided in the GTF file.

arguments option description
fastqfile required FASTQ file containing CEL-Seq reads.
-o required The prefix (including path) for the generated alignment and count file.
-g required GTF file for the relevant genome reference.
-x required HISAT2 index for the relevant genome reference.
-q optional Minimum mapping quality to retain read [default: 10].

process_damid_reads

process_damid_reads [-h] [-m PREFIX] [-q MIN_MAPQ] [-u] [-k KEEP_N] [-d MIN_EDITDISTANCE] -o OUTPREFIX -p POSARRAY -x HISAT2_INDEX fastqfile

Wrapper script that aligns DamID reads (using HISAT2) and subsequently calls generate_damid_counts.py. The function takes DamID reads (FASTQ) as input and outputs a alignment file (BAM) and count file (HDF5), containing the number of observed (UMI-unique) counts per motif occurrence.

arguments option description
fastqfile required FASTQ file containing DamID reads.
-o required The prefix (including path) for the generated alignment and count file.
-p required Motif position file for the relevant genome reference.
-x required HISAT2 index for the relevant genome reference.
-m optional Motif prefix to append to reads [default: GA].
-q optional Minimum mapping quality of aligned reads. Reads below the threshold are discarded [default: 0].
-u optional If set, available UMI information will be used to remove PCR duplicates [default: False].
-k optional Maximum number of unique UMIs to allow per position [default: 4].
-d optional Minimum edit distance between pairs of UMIs that are observed per position [default: 2].

write_posarray.py

write_posarray.py [-h] [-l ELEMENT_LENGTH] --outfile OUTFILE pos_file

Takes a BED file describing all motif occurrences and generates a table in HDF5 format with all motif positions genome-wide. For each chromosome, the first entry is the position of the first motif occurrence on the plus strand, subsequent values represent the distance to the next motif. Only positions on the plus strand are indicated, motif positions on the minus strand can be inferred.

arguments option description
pos_file required BED input file with motif positions.
--outfile required Output file name (HDF5) for posarray.
-l/--element-length optional Length of the motif in bp (e.g. 4 for the GATC motif) [default: infer length from input].

Example usage

The analysis steps in this and subsequent sections demonstrate how scDam&T-seq data can be analysed using the provided software package (scDamAndTools). File names and genome references are chosen to match the test data available as part of the package. Care should be taken to modify these when applying the analysis on other data. In some parts of the workflow, we offer the option to use a wrapper script to combine multiple processing steps. Although these wrapper scripts ease the workflow, they are less customizable. When applying scDamAndTools in a custom workflow, the individual steps offer more flexibility.

Software requirements

Many of the functions provided in the scDamAndTools package are Python3-based and can be easily incorporated in a custom workflow. An exception this are the wrapper functions "create_motif_refarrays", "process_celseq_reads" and "process_damid_reads", which are bash scripts. The workflow as explained below was developed using the following environment and tools:

Tutorial data

We have included a test dataset set as part of this repository. In the following steps, we will demonstrate how the scDam&Tools package can be implemented to analyse this data. The data comes from a mouse Embryonic Stem Cell (mESC) line expressing Dam-LaminB1. As such, we expect that the resulting DamID profiles show Lamina-Associated Domains (LADs). The raw sequencing data contains downsampled DamID and CEL-Seq reads of five single cells.

Downloading genome reference files and generating HISAT2 index

hisat2_extract_splice_sites.py $GTFFN > $SSFN hisat2_extract_exons.py $GTFFN > $EXONFN hisat2-build --ss $SSFN --exon $EXONFN $FASTAFN $HISAT2_INDEX


### Generate GATC reference arrays
To efficiently match obtained DamID reads to specific instances of the GATC motif in the genome, we generate two reference arrays. The first array (“position array” or “posarray”) contains the positions of all GATC positions in the genome. The second array (“mappability array” or “maparray”) indicates whether it is possible to uniquely align a read derived from a particular (strand-specific) GATC instance. The mappability array is used to filter out ambiguously aligning GATCs and can serve as an indicator of the (mappable) GATC density along the chromosomes. During the generation of the mappability array, in silico reads are generated for each GATC instance and are subsequently mapped back to the reference genome. The length of the reads should be chosen to be the same as the length of the reads obtained in the experiment (excluding the adapter).

#### Option 1: Using the wrapper script
- *Step 6a.1:* Generate the motif position and mappability arrays:

IN_SILICO_READLENGTH=62 ARRAY_PREFIX="./refarrays/Mus_musculus.GRCm38.dna.primary_assembly" create_motif_refarrays \ -m "GATC" \ -o $ARRAY_PREFIX \ -r $IN_SILICO_READLENGTH \ -x $HISAT2_INDEX \ $FASTAFN


#### Option 2: Performing the individual steps
- *Step 6b.1:* Find “GATC” occurrences in genome:

mkdir ./refarrays/ find_motif_occurrences.py \ $FASTAFN \ "GATC" \ | awk '$1 !~ /^ERCC/' \ | gzip > ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.positions.bed.gz

GATC positions in the ERCC sequences are discarded, since these are not present in the genomic DNA of the cells.

- *Step 6b.2:* Generate an HDF5 file with all GATC positions in the genome:

write_posarray.py \ --outfile ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.posarray.hdf5 \ ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.positions.bed.gz


- *Step 6b.3:* Generate _in silico_ reads mapping to GATC positions and align these back to the reference to determine the mappability of the positions:

IN_SILICO_READLENGTH=62; fetch_regions.py \ -q \ -l $IN_SILICO_READLENGTH \ ./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.positions.bed.gz \ $FASTAFN \ | hisat2 \ --seed 42 \ -f \ -x $HISAT2_INDEX \ -U - \ --no-spliced-alignment \ --mp '2,0' \ --sp '4,0' \ | samtools view -ub - \ | samtools sort -m 500M -l 9 \

./refarrays/Mus_musculus.GRCm38.dna.primary_assembly.GATC.readlength_62.sorted.bam

Expected output

At the end of these steps, four files should have been generated, ending in ending in “.positions.bed.gz” (all occurrences of the GATC motif in BED format), “.posarray.hdf5” (all occurrences of the GATC motif as a HDF5 array), ".sorted.bam" (alignment of the in silico generated reads in BAM format) and “.maparray.hdf5” (the mappability of all GATC motifs as a HDF5 array). In case the wrapper script was used, the alignment file (BAM) has been automatically deleted.

Demultiplexing the raw sequencing data

In this step, the raw sequencing reads are split into multiple files based on their barcodes. For each barcode, a separate file will be made. Reads that do not match any barcode or that match multiple barcodes can be stored in separated files.

NOTE: There may be multiple raw sequencing files pertaining to the same samples, for example from the different sequencing lanes. These files should be concatenated, latest prior to the sorting of aligned reads.

Expected output

The demultiplex script generates a separate FASTQ file for each barcode provided in the barcode information file (see step 8) that contains all reads matching this barcode. In addition, a text file (“index01.demultiplex_info.txt”) is generated that details the number of reads matched to each barcode:

ambiguous       0
matching_mismatch       0
matching_perfect        450000
unmatching      0
CELseq_BC_001   724     0
CELseq_BC_002   11087   0
CELseq_BC_003   12096   0
CELseq_BC_004   14603   0
CELseq_BC_005   9696    0
DamID_BC_001    102140  0
DamID_BC_002    31597   0
DamID_BC_003    111963  0
DamID_BC_004    46903   0
DamID_BC_005    109191  0

In this case, there are no reads ambiguous or unmatched barcodes, since the test dataset represents a downsampled version of several samples that have previously been selected on their barcode. In a typical experiment, 5-10% of reads cannot be unambiguously matched to any barcode.

Processing the DamID reads to counts

The subsequent steps need to be performed on all DamID demultiplexed files. It is highly recommended that this process be parallelized on a high-performance computing cluster. The amount of time necessary for these steps depends entirely on the number of libraries, samples per library and available computing cores.

Option 1: Using the wrapper script

Option 2: Performing the individual steps

Expected output

After running the above steps, the following files have been generated:

index01.DamID_BC_001.counts.binsize_100000.hdf5
index01.DamID_BC_001.counts.hdf5
index01.DamID_BC_001.counts.stats.tsv
index01.DamID_BC_001.sorted.bam

The BAM file contains the aligned DamID reads, the HDF5 file ending in ".counts.hdf5" contains the number of unique fragments observed per strand-specific GATC positions, the HDF5 file ending in ".binsize_100000.hdf5" contains the number of observed unique fragments per 100kb bin. Finally, the text file ending in ".stats.tsv" contains the processing statistics:

index01.DamID_BC_001.counts.hdf5    total_reads     102140
index01.DamID_BC_001.counts.hdf5    unmapped_reads  20280
index01.DamID_BC_001.counts.hdf5    lowmapq_reads   6705
index01.DamID_BC_001.counts.hdf5    nongatc_reads   3876
index01.DamID_BC_001.counts.hdf5    invalidumi_reads        3
index01.DamID_BC_001.counts.hdf5    valid_reads     71276
index01.DamID_BC_001.counts.hdf5    unique_counts   53612

Plotting the binned counts observed over chromosome 19 yields the following enrichment profiles: Expected DamID profiles
These enrichment profiles are in line with the output we expect for a Dam-LaminB1 cell line, albeit somewhat sparse.

Processing the CEL-Seq reads to counts

The subsequent steps need to be performed on all CEL-Seq demultiplexed files. It is highly recommended that this process be parallelized on a high-performance computing cluster. The amount of time necessary for these steps depends entirely on the number of libraries, samples per library and available computing cores.

Option 1: Using the wrapper script

Option 2: Performing the individual steps

Expected output

After running the above steps, the following files have been generated:

index01.CELseq_BC_001.bam
index01.CELseq_BC_001.counts.hdf5
index01.CELseq_BC_001.counts.stats.tsv

The BAM file contains the aligned CEL-Seq reads, while the HDF5 file contains the number of observed UMI-unique transcripts per gene, sorted by their Ensembl gene ID as provided in the GTF file. Finally, the text file ending in ".stats.tsv" contains the processing statistics:

index01.CELseq_BC_001.counts.hdf5    total_reads     724
index01.CELseq_BC_001.counts.hdf5    valid_reads     73
index01.CELseq_BC_001.counts.hdf5    ambiguous_reads 0
index01.CELseq_BC_001.counts.hdf5    no_feature_reads        247
index01.CELseq_BC_001.counts.hdf5    unmapped_reads  333
index01.CELseq_BC_001.counts.hdf5    lowmapq_reads   71
index01.CELseq_BC_001.counts.hdf5    unique_counts   67

License

scDamAndTools is licensed under MIT. The detailed license terms are in the LICENSE file.

References

Original research paper describing scDam&T-seq:

Elaborate description of protocol and analysis: