Closed tacazares closed 2 years ago
There are differences in how the pipeline currently normalizes the ATAC-seq signal. It is not a major problem, but should be corrected for consistency.
We normalize our data to reads per 20 million mapped reads (RP20M). This normalization was arbitrarily chosen based on the median sequencing depth of our data in 2019-2020. Our CWL pipeline uses only 1,000,000 million as our scaling factor . We min-max our data so this is not a big priority, but there are some data sets that have been processed with a different normalization factor.
Currently, the user will need to make sure they also have the package bedGraphToBigWig
in order to use the prepare function.
We have a function for using python subprocess
to execute a script for converting a .bam
file to a .bed
file of reads. The script will then shift the reads, remove the blacklist, slop the cut sites by 20 base pairs, and then generate a RPM normalized coverage track as a .bg
file. The .bg
file is then converted to a .bigwig
using bedGraphToBigWig
.
We want to update the pipeline to include the PCR deduplication and read filtering steps. We do not want to assume the user has performed preprocessing of the BAM file.
We have a function for using python
subprocess
to execute a script for converting a.bam
file to a.bed
file of reads. The script will then shift the reads, remove the blacklist, slop the cut sites by 20 base pairs, and then generate a RPM normalized coverage track as a.bg
file. The.bg
file is then converted to a.bigwig
usingbedGraphToBigWig
.We want to update the pipeline to include the PCR deduplication and read filtering steps. We do not want to assume the user has performed preprocessing of the BAM file.
Just a thought: Does our code check whether deduplication needs to be done before perfoming removal of duplicate reads? I'm imagining we could check for occurrence of duplicates, and if we find them, then they get removed.
We are anticipating some user to want to process their raw fastq files for predictions. This will require them to:
We currently have 3 approaches of performing these tasks. We need to double-check that these are all the same.
At the minimum, Emily wants our users to be able to process data from a BAM file for prediction within maxATAC. This will require us to implement the latter part of the ATAC-seq data processing in our code with something like
maxatac process -i {input_bam} -blacklist {blacklist.bed}
. We could require the user to provide BAM files that have PCR duplicates removed and that would make the processing easier on our end. We will need to document which packages need to be available in the users path if we rely on command line utilities likebedGraphToBigWig
.