MiraldiLab / maxATAC

Transcription Factor Binding Prediction from ATAC-seq and scATAC-seq with Deep Neural Networks
Apache License 2.0
25 stars 8 forks source link

Interesting observation with predictions: Blacklist update #90

Closed tacazares closed 2 years ago

tacazares commented 2 years ago

I was testing prediction of CTCF in GM12878 on chr1. I noticed that there are several regions of repeating and uniformly distributed predictions near the centromeres. This could indicate problematic repeat regions that are consistently causing issues. This also might be related to the repeats that David Kelley blacklisted in his PLOS com bio paper. Screen Shot 2022-02-16 at 9 24 38 PM

tacazares commented 2 years ago

I noticed there was a discrepancy between our ENCODE blacklist and the blacklist available from the Boyle Lab. This blacklist has some of the segmental duplication regions accounted for whereas the basis for our current blacklist Anshul ENCODE Blacklist does not have those included. I will be generating statistics and updating our current blacklist in the next few days. The current files that are going to be used for the new blacklist can be found here: maxATAC Blacklist Files.

tacazares commented 2 years ago

There are some differences in the coverage based on the source of the data. Our maxATAC V2 blacklist has the most coverage, because it includes high signal regions identified by Boyle in addition to the other genomic features that we curated. The final coverage is ~8.3% of the hg38 genome has unreliable signal regions.

ENCFF356LFX.bed:
     Number of Intervals: 910
     Total bps covered: 71,570,285
     Percent hg38 covered: 2.17%
hg38-blacklist.v2.bed:
     Number of Intervals: 636
     Total bps covered: 227,162,400
     Percent hg38 covered: 6.886%
hg38_centromeres.bed:
     Number of Intervals: 109
     Total bps covered: 59,546,786
     Percent hg38 covered: 1.805%
hg38_gaps.bed:
     Number of Intervals: 827
     Total bps covered: 161,348,343
     Percent hg38 covered: 4.891%
hg38_maxatac_blacklist.bed:
     Number of Intervals: 376
     Total bps covered: 217,585,970
     Percent hg38 covered: 6.596%
hg38_maxatac_blacklist_V2.bed:
     Number of Intervals: 1667
     Total bps covered: 275,198,132
     Percent hg38 covered: 8.342%
hg38_segmental_dups_chrM.bed:
     Number of Intervals: 12
     Total bps covered: 36,418
     Percent hg38 covered: 0.001%