Bioconductor / Contributions

Contribute Packages to Bioconductor
135 stars 33 forks source link

GA4GHclient #312

Closed wdesouza closed 7 years ago

wdesouza commented 7 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 7 years ago

Hi @Welliton309

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: GA4GHclient
Type: Package
Title: A Bioconductor package for accessing GA4GH API data servers
Version: 0.99.0
Authors@R: c(person("Welliton", "Souza", email = "well309@gmail.com",
    role = c("aut", "cre")),
    person("Benilton", "Carvalho", email = "benilton@unicamp.br", role = "ctb"),
    person("Cristiane", "Rocha", email = "crirocha@gmail.com", role = "ctb"))
Description: GA4GHclient provides an easy way to access public data servers
    through Global Alliance for Genomics and Health (GA4GH) Genomics API.
    This package also provides an interactive web application.
License: GPL (>= 2)
Depends:
    S4Vectors
Imports:
    AnnotationDbi,
    BiocGenerics,
    Biostrings,
    dplyr,
    DT,
    GenomeInfoDb,
    GenomicFeatures,
    GenomicRanges,
    httr,
    IRanges,
    jsonlite,
    methods,
    purrr,
    shiny,
    shinydashboard,
    shinyjs,
    tidyr,
    utils,
    VariantAnnotation
Suggests:
    BiocStyle,
    org.Hs.eg.db,
    knitr,
    rmarkdown,
    testthat,
    TxDb.Hsapiens.UCSC.hg19.knownGene
LazyData: TRUE
RoxygenNote: 6.0.1
VignetteBuilder: knitr
URL: https://github.com/labbcb/GA4GHclient
BugReports: https://github.com/labbcb/GA4GHclient/issues
biocViews: DataRepresentation, GUI, ThirdPartyClient
lawremi commented 7 years ago

Why bundle the Shiny app with the package? The API is obviously independent of Shiny, and there could be multiple GUIs built on top of it. Maybe split into a separate package?

The app() function could accept an OrganismDb object, so no need for separate OrgDb and TxDb arguments in common cases.

The pageSize argument is a bit awkward. What about returning a deferred request object with methods on generics like head() that restrict the page counts. Coercion to data.frame or something else would perform the request. Just seems more idiomatic.

Why does the user need to explicitly coerce the result of searchVariants() to VCF? Maybe the vignette could explain how the variant.calls object is useful prior to coercion.

Could searchVariantsByGRanges() be merged into searchVariants()? Could be one range argument that is coerced to GRanges, instead of several arguments that are effectively shared with the GRanges() constructor. Then you could do e.g. searchVariants(host, "1:1-1000").

wdesouza commented 7 years ago

The Shiny app is shipped along with the package to facilitate development of the interface and to provide a quick way to interact with the server. I can put the app in a separate package if necessary.

Can I extract all data of TxDB package from an OrganismDb? The packages uses these functions on TxDB package: genes, transcripts, exons and promoters.

I understand that pageSize seems to be awkward but is part of the GA4GH API schemas (Reference).

All search functions return data.frame inclusive searchVariants. To keep consistency I don't change the data structure of the response data. I added converter for VariantAnnotation classes as a convenient way to get variant data from GA4GH-based server and export to VCF format file.

I kept the referenceGenome, start and end arguments for searchVariants to reflect the GA4GH API schema (reference). I added searchVariantsByGRanges function as a convenient way to use existing GenomicRanges objects as search parameters.

bioc-issue-bot commented 7 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170309165956.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

f203eb1 Bump version for webhook

lawremi commented 7 years ago

If the goal is to conform to GA4GH schemas then I guess there is room for a more idiomatic interface built on top of this.

mtmorgan commented 7 years ago

Please do not be concerned about the Rsamtools error on windows; this seems to be a problem with the single package builder.

mtmorgan commented 7 years ago

@lawremi can you expand on your comments, please?

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170309171442.html

lawremi commented 7 years ago

By idiomatic, I mostly meant using the appropriate R data structure for the type of data. And using conventional means of restricting data instead of arguments that refer to web pages. And other stuff like that. But if the scope of this package is restricted to being a low-level API, then the current design seems adequate. It does make it strange to have an attached Shiny app though. Just some high-level comments without looking at the code.

wdesouza commented 7 years ago

@lawremi I would like the package to come closer to the R/Bioconductor than to the API. I am trying to this by using the base Bioconductor classes instead of base R types. The solution I see for removing the pageSize argument is iterating until get all response data from server. However the computer memory will should keep large amount of data. The pageSize method works like a chunk processing.

lawremi commented 7 years ago

I could see a couple options:

1) A functional iteration interface like GenomicFiles::reduceByYield(). Another useful interface would be like base::Find() where a function returns a logical vector and iteration stops once there is a TRUE in the return value, and returns the corresponding record. 2) Deferred execution of the request, so that calling the search function immediately returns an object representing the intent to perform the request. Then, the user could call head(result, 10), and the method would execute the query with pageSize=10. The show() method would only query the first 10, for example. Splitting by a Partitioning would create deferred chunks, each corresponding to a page range. Then, Find() would just work. Coercion to a DataFrame would perform the complete iteration. 3) Just iterate, assuming that the range restriction is sufficient.

wdesouza commented 7 years ago

@lawremi I forgot to mention the full parameter. When this parameter is TRUE the search functions will iterate over the request until get all response data. I think this is the expected behavior of your point 3. Do you think it would be more advisable to set this parameter to TRUE by default?

lawremi commented 7 years ago

Yes, I think that's going to lead to the least surprise. When things are too big, the user can figure out how to limit it.

lawremi commented 7 years ago

One suggestion: maybe make pageSize default to NA_integer_, and whenever it is the missing value, behave like full=TRUE. Then you could just drop the full argument. Maybe give pageSize a more intuitive name, like nmax or nrows like read.table().

wdesouza commented 7 years ago

When full = TRUE the pageSize works more like n parameter of ShortRead::FastqStreamer function. It defines how many entries the server should return in each request (the server will not respect when the value is bigger than its maximum response length). It will need to have two parameters: 1) define the chunk size (useful when it is expected large response data because large value will reduce the number of requisitions), 2) define how many entries the response data should have (number of rows of the data frame). I like this solution because it does not need the full and pageToken parameters and it simplifies the response data object (can be the S4Vectors::DataFrame instead of a derived class). What do you think @lawremi? Your feedback has been very helpful.

lawremi commented 7 years ago

Makes sense to me.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

855c289 Moved Shiny app to its own package

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170315141837.html

wdesouza commented 7 years ago

I did the modificafions suggested by @lawremi. Now search functions will retrieve all available entries unless the nrows parameter is defined. The responseSize specifies how many entries to get on each iteration. All get and search functions return DataFrame object. The searchVariants function also returns a data frame because the result contains the id of variants which is required to use getVariant function.

Shiny app was moved to its own package: https://github.com/labbcb/GA4GHshiny

nturaga commented 7 years ago

Build

March 11th

R-dev CMD build GA4GHclient: Ok

R-dev CMD INSTALL GA4GHclient: Ok

March 16th after ( 102 files changed, 930 insertions(+), 1564 deletions(-))

R-dev CMD build GA4GHclient: Ok

R-dev CMD INSTALL GA4GHclient: Ok

DESCRIPTION

lawremi commented 7 years ago

You could also make as.VCF() a method on asVCF() (that generic exists because it allows for additional arguments). I looked at as.VCF(). Are you sure you want all info columns to be CharacterList? Use lengths() instead of that sapply() call.

wdesouza commented 7 years ago

@lawremi The searchVariants function returns an DataFrame (S4Vectors packages). Is it a good idea to implement asVCF for x as DataFrame class? Or is it better to extend the DataFrame class and then implement asVCF for x as derived class? What is the best way to implement the coerce method?

I am using CharacterList at this moment because the server always return string type (reference). It was already fixed but I could not reproduce this feature. As soon I reproduce this, I will update the package.

lawremi commented 7 years ago

They might all be strings but are they all multivalued also? Why wouldn't searchVariants() just return a VCF directly? Or maybe a wrapper for doing that? An API should try hard not to discard semantics. Here, we are querying a data source of variants, dropping it to a DataFrame, and promoting back to a variant data structure. Would be nice to hide the DataFrame representation. If a low-level API has to expose it, make an explicit coercion function like makeVCFFromGA4GHResponse(). The user would rarely call that though, thanks to the wrapper.

wdesouza commented 7 years ago

Returning VCF class is a good idea because the data came from a VCF file. However there are columns that I didn't find a proper way to store in an VCF object. Maybe in metadata? These columns are id and variantSetId. These two columns are very important to retrieve new data from the server, for example: 1) search a list of variants (no variant call data returned), 2) get only one variant by its id(all variant call data returned), 3) do some stats.

Maybe the parameter asVCF = TRUE for searchVariants function. If true it will return VCF object otherwise a DataFrame. Is it a good solution?

lawremi commented 7 years ago

The VCF spec already has an "ID" column, and it's supported by the VCF object. For "variantSetId", you might put it into the info component. An Rle would be an efficient way to stored it. That makes it easy to combine variant sets into a single object.

wdesouza commented 7 years ago

Thank you @lawremi I will do that.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

7f776ca Added variant and variant set IDs into VF objects b74bdcc Added comments to utility functions d8efdc6 Bump package version

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170317133123.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

0a55730 Fix test unit of makeVCFFromGA4GHResponse

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170317134457.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

0981a07 Added missing request functions; minor changes in ...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170318174013.html

wdesouza commented 7 years ago

The searchVariants, getVariant and getVariantSet return objects from VariantAnnotation classes by default. They also return DataFrame (when asVCF = FALSE) It is useful for working with web interface.

@nturaga I added the comments and the missing request functions. The GA4GH schema is still in development. I am following and giving feedback for their work (https://github.com/ga4gh/ga4gh-schemas).

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

7f6056c Improved description text

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170324121701.html

mtmorgan commented 7 years ago

Looks like the major issues in the review have been addressed. I'll mark this as 'accepted' and add it to Bioconductor svn and nightly builds. More details in a few days via email.