Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
30 stars 14 forks source link

Caching seqinfo #26

Open jeff-mandell opened 3 years ago

jeff-mandell commented 3 years ago

Hi, my package uses genomeInfoDb, and we use the seqlevelsStyle function to clean up user-inputted data and ensure consistent chromosome names (in our case, we go with NCBI style, which means stripping chr prefixes). I can see that what seems like a simple task gets complicated under the hood with the need to download the latest info from NCBI, Ensembl, and UCSC.

I found that .UCSC_cached_chrom_info and .NCBI_cached_chrom_info store the necessary information for seqlevelsStyle throughout a session, but an internet connection is initially necessary every new session. This causes a problem for offline users and users on networks that for whatever reason are blocking any of NCBI/UCSC/Ensembl traffic (yes, this is really happening). Since seqinfo is such a small amount of data, is there a plan to take advantage of R's support for caching user data to save this information and allow seqlevelsStyle to run offline? Or is there a safe workaround to supply the necessary seqinfo?

I did it this way, but I'm concerned this could cause problems with new GenomeInfoDb releases or if anything changes on the NCBI/UCSC/Ensembl server side.

# Get information for local caching
bsg = getBSgenome("hg19")
seqlevelsStyle(bsg) = "NCBI"
ucsc_info = GenomeInfoDb:::.add_ensembl_column(ucsc_info, "hg19")
ucsc_info = getFromNamespace(".UCSC_cached_chrom_info", "GenomeInfoDb")[["hg19"]]
ucsc_info = GenomeInfoDb:::.add_ensembl_column(ucsc_info, "hg19")
ncbi_info = getFromNamespace(".NCBI_cached_chrom_info", "GenomeInfoDb")[["GCF_000001405.25"]]
saveRDS(ncbi_info, "hg19_ncbi_seqinfo_for_GenomeInfoDb.rds")
saveRDS(ucsc_info, "hg19_ucsc_seqinfo_for_GenomeInfoDb.rds")

# Later, in new (offline) R session
ucsc_info = readRDS("hg19_ucsc_seqinfo_for_GenomeInfoDb.rds")
ncbi_info = readRDS("hg19_ncbi_seqinfo_for_GenomeInfoDb.rds")
assign('hg19', ucsc_info, envir = get(".UCSC_cached_chrom_info", envir = asNamespace('GenomeInfoDb')))
assign('GCF_000001405.25', ncbi_info, envir = get(".NCBI_cached_chrom_info", envir = asNamespace('GenomeInfoDb')))

# seqlevelsStyle now works offline

`

hpages commented 3 years ago

There's no plan at the moment to take advantage of R's support for caching user data to save NCBI or UCSC assembly/genome information and allow seqlevelsStyle() to run offline.

One concern with a persistent caching solution is that there's the slight possibility that the information provided by NCBI or UCSC for a given assembly/genome changes in the future. But maybe the risk that this actually happens is so low that we shouldn't be too concerned. This could also be mitigated via an expiration mechanism e.g. NCBI or UCSC chromosome information gets automatically removed from the persistent cache after a couple of months or something like that.

Also note that even with a persistent caching solution, an internet connection would still be initially necessary so it doesn't really solve the problem for users on networks that are blocking NCBI/UCSC/Ensembl traffic.

jeff-mandell commented 3 years ago

Thanks for taking the time to respond. I understand the risk of sequence information changing. A persistent caching solution would help users who sometimes work offline, and it would prevent some crashes in HPC environments (e.g., a random node is misconfigured or has network problems). Maybe it's too niche of a need, but it could also help out package developers to be able to insert their own entries into .UCSC_cached_chrom_info and .NCBI_cached_chrom_info for use in these situations. The need for the end user to do simple harmonization of human data (just making chr prefixes and M/MT consistent, without regard for non-primary assembly sequences) is probably pretty widespread.

hpages commented 3 years ago

Bingo! And just when we were talking about the possibility of UCSC suddenly changing the chromosome information of their genomes, they just do it! See issue #27.

Note that this is not the first time. They already did this last year with hg19 when they decided to base it on GRCh37.p13 instead of GRCh37. This broke many things and created a lot of confusion.

hpages commented 1 year ago

Hi @jeff-mandell ,

Just to let you know that I implemented an "offline mode" for getChromInfoFromUCSC(). This is in GenomeInfoDb 1.33.9. See commit 345f22c55b8c431f1cf8080af3235f78266ade9c.

Note that it's only a partial "offline mode" i.e. it works when called with assembled.molecules.only=TRUE and only for a selection of registered genomes. See "Offline mode" in ?getChromInfoFromUCSC for more information.

Cheers, H.

jeff-mandell commented 1 year ago

Thank you, this is nice to have!

nvictus commented 5 months ago

@hpages Are there plans to make "offline" assembly metadata available on AnnotationHub like the Ensembl, UCSC transcription DBs?

hpages commented 5 months ago

There are plans to make some assembly metadata available offline but there's no clear roadmap yet. In particular whether it's going to be via AnnotationHub or other means has not been decided.

Note that the chrom info for some UCSC genomes is already available offline e.g. getChromInfoFromUCSC("hg38", assembled.molecules.only=TRUE) or getChromInfoFromUCSC("hs1", assembled.molecules.only=TRUE) work offline. The offline mode only works if assembled.molecules.only=TRUE, that is, if one tries to obtain the chrom info for the chromosomes only and not for all the sequences in the genome assembly (i.e. chromosome + scaffolds).