Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

dynamic Seqinfo lookup for hg19 failing #9

Closed lawremi closed 4 years ago

lawremi commented 4 years ago

The HelloRanges package is failing in devel and release because the dynamic resolution of sequence information from UCSC is broken for hg19:

Seqinfo(genome = "hg19")
## Error in FUN(genome = names(SUPPORTED_UCSC_GENOMES)[idx], circ_seqs =
## supported_genome$circ_seqs,  : 
##  cannot map the following UCSC seqlevel(s) to an NCBI seqlevel:
##  chr1_jh636052_fix, chrX_jh806600_fix, chrX_jh806587_fix,
##  chr7_jh159134_fix, chrX_jh159150_fix, chrX_jh806590_fix,
##  chr10_jh591181_fix, chr1_jh636053_fix, chr5_gl339449_alt,
##  chr14_kb021645_fix, chrX_jh720453_fix, chrX_jh806601_fix,
##  chr7_gl582971_fix, chrX_jh806599_fix, chr19_gl949749_alt,
##  chr19_gl949750_alt, chr19_gl949748_alt, chr19_kb021647_fix,
##  chrX_jh806597_fix, chr10_ke332501_fix, chr19_gl949751_alt,
##  chr19_gl949746_alt, chr19_gl949752_alt, chrX_jh806598_fix,
##  chrX_jh720451_fix, chrX_jh806591_fix, chr11_jh806581_fix,
##  chrX_jh806588_fix, chrX_jh806592_fix, chr19_gl949753_alt,
##  chr1_jh636054_fix, chrX_jh720454_fix, chr19_gl949747_alt,
##  chr7_jh636058_fix, chrX_jh806602_fix, chr17_gl383561_fix,
##  chr8_gl949743_fix, chr2_kb663603_fix, chr19_gl582977_fix,
##  chr19_ke332505_fix, chr11_jh159140_fix, chr5_ke332497_fix,
##  chr17_gl383560_fix, chrX_jh720452_fix, chr4_ke332496_fix,
##  chr6_kb663604_fix, chr

I wonder if it would be better to include a static copy of this information, at least for the most commonly accessed genome? Stabler, faster and more reproducible.

hpages commented 4 years ago

Should be fixed in GenomeInfoDb 1.23.14 (see commit 573077ce).

A static copy would be faster but would become out-of-sync when the genome changes, like here. So we would need to have some mechanism in place to alert us when this happens e.g. some long tests (they're run once a week) that pull the sequence information again from UCSC and compare to the static copy.

lawremi commented 4 years ago

Thanks. That the genome can change is problematic from the reproducibility standpoint. We may want to ensure stability, i.e., be purposefully out of sync, and only update once a release or something, with the information tied to the package version.