databio / pepatac

A modular, containerized pipeline for ATAC-seq data processing
http://pepatac.databio.org
BSD 2-Clause "Simplified" License
54 stars 15 forks source link

The count table with a PEPATAC-produced consensus peak set is incorrect. #274

Closed zhongzheng1999 closed 6 months ago

zhongzheng1999 commented 9 months ago

I attempted to manually compress ref_peak.bed into ref_peak.bed.gz using gzip. Subsequently, when running looper runp, I successfully obtained a count table(#273). But I found out that this table could be wrong!

image

By examining the PEPATAC_comands.sh, I noticed that after obtaining the peaks_coverage.bed file through bedtools coverage, you use awk to append a column as the 8th column in the file. This column primarily involves normalizing the 5th column of the bed file.

bedtools coverage -sorted -a peaks_sort.bed -b sort_dedup.bam -g chr_order.txt | uniq > peaks_coverage.bed
awk 'BEGIN{OFS="\t";} NR==FNR{sum += $5; next}{if (sum <= 0){sum = 1} print $1, $2, $3, $4, $5, $6, $7, ($5/sum*1000000)}' peaks_coverage.bed peaks_coverage.bed > norm_peaks_coverage.bed

I think you might be attempting to normalize the read counts, enabling the comparison of peaks across different samples. However, a crucial point to note is that the bed file format obtained from bedtools coverage is: "chr", "start", "end", "read_count", "base_count", "width", "frac". In simpler terms, the read counts are in the 4th column, but you are standardizing the 5th column, which represents base counts.