Open jashapiro opened 4 years ago
Soem simple statistics comparing the ENCODE blacklist to our current blacklist file: The ENCODE blacklist consists of 71570285 bp in 910 regions
I compared the encode blacklist to components of our current blacklist using bedtools jaccard
file | intersection | union | jaccard | n_intersections |
---|---|---|---|---|
centromeres.bed | 65782885 | 98687400 | 0.666578 | 59 |
heterochromatin.bed | 2229051 | 205668649 | 0.0108381 | 43 |
immunoglobulin_regions.bed | 1174 | 78705587 | 1.49163e-05 | 2 |
segmental_dups.bed | 2459981 | 209957022 | 0.0117166 | 173 |
telomeres.bed | 166447 | 96361119 | 0.00172733 | 10 |
It appears that the ENCODE blacklist covers a good fraction of the centromeres (though not all), but other components are largely independent.
2908285 bp would be newly added to our blacklist by including the ENCODE blacklist.
In our analysis of copy number variation, we currently use a manually curated set of regions for which copy number variant analysis tends to be inaccurate, but this list is not tied to a clear public resource that is easily referenced and downloadable. The current list also seems to miss many false positive-prone regions, as indicated by strong banding patterns independent of sample type in figures such as this heatmap of CN variation
The ENCODE blacklist may provide an alternative resource for identifying regions which may be problematic in CNV analysis as well as other analyses where mismapping can lead to error.
A previous version of this blacklist and discussion of the situations where its use is recommended is published here:
Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z
An updated version of the blacklist has just been released at the following location. https://www.encodeproject.org/files/ENCFF356LFX/
What changes need to be made? Please provide enough detail for another participant to make the update.
The ENCODE blacklist should be compared to our current CNV exclusion lists in the
copy_number_consensus_call
module. If there is sufficient overlap with the current set of excluded regions, we may be able to simply replace the current exclusion regions with the ENCODE blacklist. Otherwise, we may consider adding the ENCODE blacklist to the list of excluded regions.Results should be checked to see if apparent false positives are reduced, while preserving known signals.
Other analysis, including SNVs, should also be checked against the ENCODE blacklist to avoid potential false positive signals.