AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
101 stars 67 forks source link

Discussion: Employ ENCODE blacklist to exclude problematic genomic regions #705

Open jashapiro opened 4 years ago

jashapiro commented 4 years ago

In our analysis of copy number variation, we currently use a manually curated set of regions for which copy number variant analysis tends to be inaccurate, but this list is not tied to a clear public resource that is easily referenced and downloadable. The current list also seems to miss many false positive-prone regions, as indicated by strong banding patterns independent of sample type in figures such as this heatmap of CN variation

The ENCODE blacklist may provide an alternative resource for identifying regions which may be problematic in CNV analysis as well as other analyses where mismapping can lead to error.

A previous version of this blacklist and discussion of the situations where its use is recommended is published here:

Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). https://doi.org/10.1038/s41598-019-45839-z

An updated version of the blacklist has just been released at the following location. https://www.encodeproject.org/files/ENCFF356LFX/

What changes need to be made? Please provide enough detail for another participant to make the update.

The ENCODE blacklist should be compared to our current CNV exclusion lists in the copy_number_consensus_call module. If there is sufficient overlap with the current set of excluded regions, we may be able to simply replace the current exclusion regions with the ENCODE blacklist. Otherwise, we may consider adding the ENCODE blacklist to the list of excluded regions.

Results should be checked to see if apparent false positives are reduced, while preserving known signals.

Other analysis, including SNVs, should also be checked against the ENCODE blacklist to avoid potential false positive signals.

jashapiro commented 4 years ago

Soem simple statistics comparing the ENCODE blacklist to our current blacklist file: The ENCODE blacklist consists of 71570285 bp in 910 regions

I compared the encode blacklist to components of our current blacklist using bedtools jaccard

file intersection union jaccard n_intersections
centromeres.bed 65782885 98687400 0.666578 59
heterochromatin.bed 2229051 205668649 0.0108381 43
immunoglobulin_regions.bed 1174 78705587 1.49163e-05 2
segmental_dups.bed 2459981 209957022 0.0117166 173
telomeres.bed 166447 96361119 0.00172733 10

It appears that the ENCODE blacklist covers a good fraction of the centromeres (though not all), but other components are largely independent.

2908285 bp would be newly added to our blacklist by including the ENCODE blacklist.