Write a maxATAC function to prepare ATAC-seq data from a BAM file and review main pipelines

MiraldiLab / maxATAC

Transcription Factor Binding Prediction from ATAC-seq and scATAC-seq with Deep Neural Networks

Apache License 2.0

25 stars 8 forks source link

Write a maxATAC function to prepare ATAC-seq data from a BAM file and review main pipelines #67

Closed tacazares closed 2 years ago

tacazares commented 2 years ago

We are anticipating some user to want to process their raw fastq files for predictions. This will require them to:

Align reads to the genome (bowtie2)
Quality control (samtools, picard, bedtools)
Convert fragments to cut sites (bedtools)
Shifting reads to account for tn5 overhang (awk)
Generating read-depth normalized signal tracks (bedtools, bedGraphToBigWig).

We currently have 3 approaches of performing these tasks. We need to double-check that these are all the same.

At the minimum, Emily wants our users to be able to process data from a BAM file for prediction within maxATAC. This will require us to implement the latter part of the ATAC-seq data processing in our code with something like maxatac process -i {input_bam} -blacklist {blacklist.bed}. We could require the user to provide BAM files that have PCR duplicates removed and that would make the processing easier on our end. We will need to document which packages need to be available in the users path if we rely on command line utilities like bedGraphToBigWig.

tacazares commented 2 years ago

There are differences in how the pipeline currently normalizes the ATAC-seq signal. It is not a major problem, but should be corrected for consistency.

We normalize our data to reads per 20 million mapped reads (RP20M). This normalization was arbitrarily chosen based on the median sequencing depth of our data in 2019-2020. Our CWL pipeline uses only 1,000,000 million as our scaling factor . We min-max our data so this is not a big priority, but there are some data sets that have been processed with a different normalization factor.

tacazares commented 2 years ago

Currently, the user will need to make sure they also have the package bedGraphToBigWig in order to use the prepare function.

tacazares commented 2 years ago

We have a function for using python subprocess to execute a script for converting a .bam file to a .bed file of reads. The script will then shift the reads, remove the blacklist, slop the cut sites by 20 base pairs, and then generate a RPM normalized coverage track as a .bg file. The .bg file is then converted to a .bigwig using bedGraphToBigWig.

We want to update the pipeline to include the PCR deduplication and read filtering steps. We do not want to assume the user has performed preprocessing of the BAM file.

emiraldi commented 2 years ago

We have a function for using python subprocess to execute a script for converting a .bam file to a .bed file of reads. The script will then shift the reads, remove the blacklist, slop the cut sites by 20 base pairs, and then generate a RPM normalized coverage track as a .bg file. The .bg file is then converted to a .bigwig using bedGraphToBigWig.

We want to update the pipeline to include the PCR deduplication and read filtering steps. We do not want to assume the user has performed preprocessing of the BAM file.

Just a thought: Does our code check whether deduplication needs to be done before perfoming removal of duplicate reads? I'm imagining we could check for occurrence of duplicates, and if we find them, then they get removed.