Profiling of transcription factor binding sites in cell-free DNA
Raw sequencing data is trimmed to contain bases 53-113 (i.e. the central 60bp in a hypothetical 166bp fragment), PCR duplicates are removed and trimmed read is aligned.
In a second step, coverage data around transcription factor binding sites are averaged and the activity for each transcription factor is calculated.
Transcription factor binding sites are taken from GTRD database (http://gtrd.biouml.org/)
Software needed:
Binaries should be put in ./Software/
FastQ files are trimmed stripping away anything between base 53-113. This should still contain the most central 60bp of a hypothetical 166 bp fragment and thus be the portion of the fragment that is closest to the nucleosome dyad.
usage: trim_and_align.py [-h] -fq FASTQ_FILE -s NAME [-o OUTDIR] [-k] [-t THREADS] [-step START_STEP]
Single-end aligned BAM files can also be diretly manipulated uing trim_from_bam_single_end.py
This is useful when the FastQ files are short than 113 bp.
However, in order to increase speed the BAM loses a lot of details: CIGAR strings, sequence is replaced with all "A"s (in order not to
look up the reference sequence)
The trimmed BAM is still good for coverage analyses, though.
usage: trim_from_bam_single_end.py [-h] -i IN_FILE -o OUT_FILE -l LENGTH
This is used in order to correct for local coverage differences due to copy-number alterations. The output of this is directly fed into the subsequent program, where coverage One option to do this is Plasma-Seq (https://github.com/PeterUlz/PlasmaSeq). Other options could include: ichorCNA (https://github.com/broadinstitute/ichorCNA) or qDNAseq (https://bioconductor.org/packages/release/bioc/html/QDNAseq.html)
However, make sure the input file for the following steps looks like this:
`