Closed tacazares closed 2 years ago
I noticed there was a discrepancy between our ENCODE blacklist and the blacklist available from the Boyle Lab. This blacklist has some of the segmental duplication regions accounted for whereas the basis for our current blacklist Anshul ENCODE Blacklist does not have those included. I will be generating statistics and updating our current blacklist in the next few days. The current files that are going to be used for the new blacklist can be found here: maxATAC Blacklist Files.
There are some differences in the coverage based on the source of the data. Our maxATAC V2 blacklist has the most coverage, because it includes high signal regions identified by Boyle in addition to the other genomic features that we curated. The final coverage is ~8.3% of the hg38 genome has unreliable signal regions.
ENCFF356LFX.bed:
Number of Intervals: 910
Total bps covered: 71,570,285
Percent hg38 covered: 2.17%
hg38-blacklist.v2.bed:
Number of Intervals: 636
Total bps covered: 227,162,400
Percent hg38 covered: 6.886%
hg38_centromeres.bed:
Number of Intervals: 109
Total bps covered: 59,546,786
Percent hg38 covered: 1.805%
hg38_gaps.bed:
Number of Intervals: 827
Total bps covered: 161,348,343
Percent hg38 covered: 4.891%
hg38_maxatac_blacklist.bed:
Number of Intervals: 376
Total bps covered: 217,585,970
Percent hg38 covered: 6.596%
hg38_maxatac_blacklist_V2.bed:
Number of Intervals: 1667
Total bps covered: 275,198,132
Percent hg38 covered: 8.342%
hg38_segmental_dups_chrM.bed:
Number of Intervals: 12
Total bps covered: 36,418
Percent hg38 covered: 0.001%
I was testing prediction of CTCF in GM12878 on chr1. I noticed that there are several regions of repeating and uniformly distributed predictions near the centromeres. This could indicate problematic repeat regions that are consistently causing issues. This also might be related to the repeats that David Kelley blacklisted in his PLOS com bio paper.