HCBravoLab / metagenomeFeatures

R package for annotating metagenomic datasets with taxonomic information
10 stars 5 forks source link

sreads name requirement - workflow example #1

Closed jnpaulson closed 9 years ago

jnpaulson commented 9 years ago

Not having a query.rdata file I generated my own and ran the pipeline only to get an error if I did not include names for my DNAStringSet object (whereas in ShortRead it appears to be replaced by ids). Is that expected that includes the ids and named DNAStringSet?

library(msd16s)
library(ShortRead)
n = 500
ord = order(rowSums(msd16s),decreasing=TRUE)[1:n]
centers = as.character(fData(msd16s[ord,])$clusterCenter)
query = DNAStringSet(centers)
otu_ids = rownames(fData(msd16s[ord,]))

# If I don't include a named DNAStringObject I get errors when running annotate:
# Error in data.frame(query_id = names(query_subset), Keys = db_subset,  : 
#  arguments imply differing number of rows: 0, 30
names(query) = otu_ids
id = BStringSet(otu_ids)
query = ShortRead(query,id)

tmpd = tempdir()
fl = paste(tmpd,"query.rdata",sep="")
save(query,file=fl)
rm(query)
load(fl)

query_subset <- sread(query)[1:30]
MgAnnoDF <- gg13.5MgDb$annotate(query = query_subset, mapping = "arbitrary")
MgAnnoDF
vjcitn commented 9 years ago

note: can't build vignette without query.rdata

nate-d-olson commented 9 years ago

You should be able to build the vignette with mgQuery.

For the unnamed DNAStringObjects error. The sequence names are used creating the metagenomeAnnotation object. Does it seem reasonable to add names to unnamed DNAStringObjects using the sequence index positions, or explicitly require named DNAStringObjects?

vjcitn commented 9 years ago

Thanks. First try to build with updated package led to

Quitting from lines 27-32 (Example_16S_Annotation_Workflow.Rmd) Error in data.frame(query_id = names(query_subset), Keys = db_subset, : object 'query_subset' not found

Tried to run on source after purl and that succeeded. Rendered the example Rmd again in the same session and

split_by(MgAnnoDF, "Phylumn") returned

list()

nate-d-olson commented 9 years ago

Typo in Phylum, it should work now.

jnpaulson commented 9 years ago

@vjcitn - at least on my machine things are working now. Is everything running smoothly for you as well?

vjcitn commented 9 years ago

i think they are close. recent attempt

Error : .onLoad failed in loadNamespace() for 'greengenes13.5MgDb', details:

call: sqliteSendQuery(con, statement, bind.data)

error: rsqlite_query_send: could not execute1: cannot rollback - no transaction is active

Error: loading failed

recover called non-interactively; frames dumped, use debugger() to view

%vjcair>

i think it is a memory limitation... but this should not be happening. you should not

be doing heavy computation in .onLoad IMHO. there's a database somewhere, check

for its existence, establish a connection, and go away, have functions that establish

any content-related entities like dplyr tables or whatever that the user must invoke

to move forward. i don't know why the onLoad is taking so long.

if the database doesn't exist, die with message on how to create it.

On Wed, Aug 5, 2015 at 11:46 AM, Joseph N. Paulson <notifications@github.com

wrote:

@vjcitn https://github.com/vjcitn - at least on my machine things are working now. Is everything running smoothly for you as well?

— Reply to this email directly or view it on GitHub https://github.com/HCBravoLab/metagenomeFeatures/issues/1#issuecomment-128045808 .

nate-d-olson commented 9 years ago

I moved the dowloading and building the database outside of onLoad with a function getGreenGenes13.5MgDb() to build the database. However the gg13.5MgDb object is no longer loaded into namespace. Do you have ayn recommendations for how to add gg13.5MgDb to the namespace so that is it available when the library is attached? The changes are in the onLoad_fix of greengenes13.5MgDb (https://github.com/HCBravoLab/greengenes13.5MgDb/tree/onLoad_fix).

vjcitn commented 9 years ago

Why not just put the large files that you need in (an) experimental data package(s), and depend on it (them)? This way you can have a vignette describing the content and any maintenance considerations in that experimental data package (one per microbiome database you want to deal with). This could be a placeholder approach until AnnotationHub is available to serve the file ... in this case, your onLoad could check for the material in the AnnotationHub cache and fail if it is not there, directing the user to use a given AnnotationHub command before trying again.

If these contents are too volatile to be packaged in this way it becomes perhaps more like a biomaRt situation. AnnotationHub might still be relevant but I don't know that it is designed to deal with very volatile content. In fact part of Bioconductor's core principle is that annotation should freeze for 6 months or so anyway to give analysts some stability.

In any event, I am no authority on how to deal with large annotation resources. Herve and others in the community have addressed it in different ways. But I do think having long silent onLoad events (core seems to want load to be silent, so you can't notify the user about what is going on) is to be avoided. Why not email biocore for more input?

On Thu, Aug 6, 2015 at 11:40 PM, nate-d-olson notifications@github.com wrote:

I moved the dowloading and building the database outside of onLoad with a function getGreenGenes13.5MgDb() to build the database. However the gg13.5MgDb object is no longer loaded into namespace. Do you have ayn recommendations for how to add gg13.5MgDb to the namespace so that is it available when the library is attached? The changes are in the onLoad_fix of greengenes13.5MgDb ( https://github.com/HCBravoLab/greengenes13.5MgDb/tree/onLoad_fix).

— Reply to this email directly or view it on GitHub https://github.com/HCBravoLab/metagenomeFeatures/issues/1#issuecomment-128581257 .

nate-d-olson commented 9 years ago

Just finised updating the greengenes13.5MgDb onLoad so that is no longer downloads the data. The data will be included in the package once it is in bioconductor but in the mean time, a script in inst/scripts can be used to download the data. Instruction for installing the package are provided on the github site. Let me know if you run into any problems. https://github.com/HCBravoLab/greengenes13.5MgDb