BoostDM saturation vep files have three main problems:
The tabix is skipping systematically the first position of the cds_25bp.regions.tsv.gz
We are introducing entire exons that are formed of non coding.
Tackling the problem:
[x] We want to fix the tabix check
[x] We want to define a coding region where we take only CDS regions and add + 5 bp at the beginning and at the end
BoostDM dataset --> cds-5spli.regions.gz
The Biomart query we use in IntOGen has genomic coordinates that will replace the exon coordinates we use in BoostDM regions, this will align the region definition between IntOGen and BoostDM.
BoostDM saturation vep files have three main problems:
Tackling the problem:
BoostDM dataset --> cds-5spli.regions.gz
The Biomart query we use in IntOGen has genomic coordinates that will replace the exon coordinates we use in BoostDM regions, this will align the region definition between IntOGen and BoostDM.
We then redefined the splicing region to be 5 bp instead of 25. commit: https://github.com/bbglab/intogen-plus/commit/dc3c9cc974549cec8970f1dfda6878fd42c7e0a8
DriverSaturation step
New run of the saturation step was done, tackling the issue of the first position error.
commit: https://github.com/bbglab/intogen-plus/commit/9580b1a437030ce35cb712c5020b8b027b8b93dc