Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

Repeat building of makeOrgDbFromNCBI is significantly slower #52

Closed lshep closed 6 months ago

lshep commented 1 year ago

When I say significantly slower I mean it would rebuild the cache once a day that took a few hours but we had optimized so that when building multiple in a row, subsequent calls would take a few seconds (minutes at most) now it takes hours for each again. I am trying to still build 3.17 non standard org db to put into AnnotationHub which requires building ~1900 right now. This used to take me 3 days -- its going on 6 weeks or more!!! My suspicious is it has to do with this commit https://github.com/Bioconductor/AnnotationForge/commit/27b4772bb164ed40269b9a770e4bfdd722fdbbc8 but if I move it back then in local testing I see the previously reported

 ERROR [2023-06-21 10:25:56] error processing DATA: error in evaluating the argument 'table' in selecting a method for function '%in%': A libcurl function was given a bad argument

which also never used to occur.
Any advice is appreciated @jmacdon

jmacdon commented 1 year ago

@lshep I'll take a look.

lshep commented 1 year ago

@jmacdon Thanks -- worth noting I could be wrong that it is related to that commit but just thinking that writing each time might be the bottle neck slow down/where I think we thought that if it was using the same data it wouldn't need to write anything new and just use existing data and only need to write if being updated. But again - it could be in a different place too.

jmacdon commented 1 year ago

@lshep What is the exact call you are using to build the OrgDb?

lshep commented 1 year ago

I'm using the receipe in AnnotationHub that calls this function underneath

meta <- updateResources("NonStandardOrgDb",BiocVersion = "3.17", preparerClasses = "NCBIImportPreparer",metadataOnly = FALSE, insert = FALSE, justRunUnitTest = FALSE)
jmacdon commented 1 year ago

Backed out changes to .downloadData so using rebuildCache = FALSE will now use existing NCBI.sqlite Db to build the OrgDb instead of rebuilding the SQLite Db first.