Discussion: Employ ENCODE blacklist to exclude problematic genomic regions

In our analysis of copy number variation, we currently use a manually curated set of regions for which copy number variant analysis tends to be inaccurate, but this list is not tied to a clear public resource that is easily referenced and downloadable. The current list also seems to miss many false positive-prone regions, as indicated by strong banding patterns independent of sample type in figures such as this heatmap of CN variation

The ENCODE blacklist may provide an alternative resource for identifying regions which may be problematic in CNV analysis as well as other analyses where mismapping can lead to error.

A previous version of this blacklist and discussion of the situations where its use is recommended is published here:

Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z

An updated version of the blacklist has just been released at the following location. https://www.encodeproject.org/files/ENCFF356LFX/

What changes need to be made? Please provide enough detail for another participant to make the update.

The ENCODE blacklist should be compared to our current CNV exclusion lists in the copy_number_consensus_call module. If there is sufficient overlap with the current set of excluded regions, we may be able to simply replace the current exclusion regions with the ENCODE blacklist. Otherwise, we may consider adding the ENCODE blacklist to the list of excluded regions.

Results should be checked to see if apparent false positives are reduced, while preserving known signals.

Other analysis, including SNVs, should also be checked against the ENCODE blacklist to avoid potential false positive signals.

Soem simple statistics comparing the ENCODE blacklist to our current blacklist file: The ENCODE blacklist consists of 71570285 bp in 910 regions

I compared the encode blacklist to components of our current blacklist using bedtools jaccard

file	intersection	union	jaccard	n_intersections
centromeres.bed	65782885	98687400	0.666578	59
heterochromatin.bed	2229051	205668649	0.0108381	43
immunoglobulin_regions.bed	1174	78705587	1.49163e-05	2
segmental_dups.bed	2459981	209957022	0.0117166	173
telomeres.bed	166447	96361119	0.00172733	10

It appears that the ENCODE blacklist covers a good fraction of the centromeres (though not all), but other components are largely independent.

2908285 bp would be newly added to our blacklist by including the ENCODE blacklist.

AlexsLemonade / OpenPBTA-analysis

Discussion: Employ ENCODE blacklist to exclude problematic genomic regions #705

What changes need to be made? Please provide enough detail for another participant to make the update.