Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

makeOrgPackageFromNCBI never finishes #17

Closed npokorzynski closed 2 years ago

npokorzynski commented 3 years ago

Hi,

This is my first time using AnnotationForge so it's entirely possible that I'm doing something wrong here, but I'm attempting to make an OrgDb object from NCBI with the following code and the output directed to a specific working directory I'm using:

library(AnnotationForge) makeOrgPackageFromNCBI(version = "0.1", author = "Nick D. Pokorzynski <nick.pokorzynski@unmc.edu>", maintainer = "Nick D. Pokorzynski <nick.pokorzynski@unmc.ed>", outputDir = ".", tax_id = "471472", genus = "Chlamydia", species = "trachomatis L2 434/Bu")

When I run this, the program continues to stall at the step of "processing GO data." It never seems to error out or fail, but just never completes. The attempt I'm running currently has been going overnight and has yet to complete or fail. That being the case, I'm at a real loss for what I ought to try to remedy this - any help is appreciated!

Thanks, Nick

lshep commented 2 years ago

Sorry for the delayed response. There are a few different things that could be going on. Firstly, The function indeed takes a few hours to run if not longer. The GO data step when I ran this morning took roughly 2.5 hours on its own.

With newer versions of R, it could also fail do to the timeout limit default being decreased. you could try doing option(timeout=10000) to increase the timeout limit for downloads.

The size requirement is significantly more than reported earlier (12 G reported earlier but it can get up to 62 G depending on flags used to run the function) so its possible you might have run out of space?

If this is still an issues please respond back with sessionInfo() and we will reopen the issue for further investigation

While this specific species is not found in the AnnotationHub -- it is also worth noting that at release time, the core team will provide ~1500 orgdb databases through the AnnotationHub interface and might not be necessary to create your own.

npokorzynski commented 2 years ago

While this specific species is not found in the AnnotationHub -- it is also worth noting that at release time, the core team will provide ~1500 orgdb databases through the AnnotationHub interface and might not be necessary to create your own.

Can you add some context to this for me? I didn't know there was an upcoming release - when is it scheduled for? And is there any info on what species will be included in the release?

lshep commented 2 years ago

There is a Bioconductor release twice a year. Normally one in the spring (April/May) and one in the fall (Oct/Nov). We have not set a tentative release yet but it is looking like end of April. Any deadlines and the official date will be located at https://bioconductor.org/developers/release-schedule/

The species generated is somewhat random. We take the top 1000 from NCBI that are listed in the generated data object provided in AnnotationForge system.file('extdata','viableIDs.rda', package='AnnotationForge') I'll have to look into the specifics of how the 1000 are determined each release in the generation script for that object.