CCBR / CHAMPAGNE

CHromAtin iMmuno PrecipitAtion sequencinG aNalysis pipEline
https://ccbr.github.io/CHAMPAGNE/
MIT License
0 stars 2 forks source link

Consider consensus peak overlap procedure #159

Open kelly-sovacool opened 11 months ago

kelly-sovacool commented 11 months ago

During consensus peak calling, consider using the method from Corces et al. (doi:10.1126/science.aav1898) to handle overlapping peaks, rather than bedtools merge. From pgs. 6-7 of the supplement:

Peak calling for 796 ATAC-seq profiles and 23 cancer types was performed to ensure high quality fixed-width peaks. We chose to use fixed-width peaks because (i) it makes count based and motif focused analyses less biased to large peaks and (ii) with large datasets merging peak sets to obtain a union peak set can lead to many peaks being merged into one very large peak, limiting our ability to resolve independent peaks. Because each cancer type is not represented by an equal number of samples, we first determined a peak set for each cancer type individually. Initially, performing peak calling with MACS2, we found that peak calls were affected by changes in data quality (TSS enrichment scores ranged from 3.94 to 19 in our dataset) and read depth (range 26 million to 258 million per replicate). To overcome this issue, we designed a peak calling procedure that would produce a set of high confidence peaks. For each sample, peak calling was performed on the Tn5-corrected single-base insertions using the MACS2 callpeak command with parameters “--shift -75 --extsize 150 --nomodel --call-summits --nolambda --keep-dup all -p 0.01”. The peak summits were then extended by 250 bp on either side to a final width of 501 bp, filtered by the ENCODE hg38 blacklist (https://www.encodeproject.org/annotations/ENCSR636HFF/), and filtered to remove peaks that extend beyond the ends of chromosomes.

Overlapping peaks called within a single sample were handled using an iterative removal procedure. First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak. This prevents the removal of peaks due to “daisy chaining” or indirect overlap and simultaneously maintains a compendium of fixed-width peaks. This resulted in a set of fixed-width peaks for each sample which we refer to here as a “sample peak set”.

kopardev commented 10 months ago

Check this

kopardev commented 4 days ago

@kelly-sovacool we can call these "corces.consensus" peaks and have them output when we have replicates. .. what is your opinion?