JetBrains-Research / span

SPAN Semi-supervised Peak Analyzer
https://doi.org/10.1093/bioinformatics/btab376
MIT License
9 stars 1 forks source link

SPAN ignores chromosome alternative contigs #8

Closed iromeo closed 5 years ago

iromeo commented 5 years ago

BAM file could include contigs for different chr haplo groups and other contigs, e.g. like mentioned in hg19.chrom.sizes file:

chr6_dbb_hap3   4610396
chr17_ctg5_hap1 1680828
chr4_ctg9_hap1  590426
chr1_gl000192_random    547496
chrUn_gl000225  211173
chr4_gl000194_random    191469
chr4_gl000193_random    189789

As far as I understand, SPAN simply ignores aligned reads in such contigs and doesn't do peak calling there. IMHO a general purpose peak caller should call peaks for all available contigs mentioned in BAM and chromosome sizes file

If we are going to fix it, don't forget about https://github.com/JetBrains-Research/epigenome/issues/1168. User-defined chromosome sizes file could slightly differ from our reference (e.g. some contigs are missing or contain extra contigs) so it shouldn't stop SPAN model from being loaded into JBR.

olegs commented 5 years ago

It ignores them because of default GenomeQuery chromosomes filtration in get method by name

private val MAPPED_CHRS_PATTERN = "chr[0-9a-tv-zA-TV-Z]+[0-9a-zA-Z]*".toRegex()
olegs commented 5 years ago

As of SPAN-0.8.0.4533 ignored chromosomes are logged in output:

[Nov 23, 2018 19:46:44] Ignored chromosomes /Users/oleg/work/galaxy/database/jobs_directory/000/25/working/mm10.chrom.sizes: chr5_JH584299_random, chrX_GL456233_random, chrY_JH584301_random, chr1_GL456211_random, chr4_GL456350_random, chr4_JH584293_random, chr1_GL456221_random, chr5_JH584297_random, chr5_JH584296_random, chr5_GL456354_random, chr4_JH584294_random, chr5_JH584298_random, chrY_JH584300_random, chr7_GL456219_random, chr1_GL456210_random, chrY_JH584303_random, chrY_JH584302_random, chr1_GL456212_random, chrUn_JH584304, chrUn_GL456379, chr4_GL456216_random, chrUn_GL456393, chrUn_GL456366, chrUn_GL456367, chrUn_GL456239, chr1_GL456213_random, chrUn_GL456383, chrUn_GL456385, chrUn_GL456360, chrUn_GL456378, chrUn_GL456389, chrUn_GL456372, chrUn_GL456370, chrUn_GL456381, chrUn_GL456387, chrUn_GL456390, chrUn_GL456394, chrUn_GL456392, chrUn_GL456382, chrUn_GL456359, chrUn_GL456396, chrUn_GL456368, chrM, chr4_JH584292_random, chr4_JH584295_random
olegs commented 5 years ago

Fixed as of version 0.10.0