Closed wdesouza closed 7 years ago
Hi @Welliton309
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: GA4GHclient
Type: Package
Title: A Bioconductor package for accessing GA4GH API data servers
Version: 0.99.0
Authors@R: c(person("Welliton", "Souza", email = "well309@gmail.com",
role = c("aut", "cre")),
person("Benilton", "Carvalho", email = "benilton@unicamp.br", role = "ctb"),
person("Cristiane", "Rocha", email = "crirocha@gmail.com", role = "ctb"))
Description: GA4GHclient provides an easy way to access public data servers
through Global Alliance for Genomics and Health (GA4GH) Genomics API.
This package also provides an interactive web application.
License: GPL (>= 2)
Depends:
S4Vectors
Imports:
AnnotationDbi,
BiocGenerics,
Biostrings,
dplyr,
DT,
GenomeInfoDb,
GenomicFeatures,
GenomicRanges,
httr,
IRanges,
jsonlite,
methods,
purrr,
shiny,
shinydashboard,
shinyjs,
tidyr,
utils,
VariantAnnotation
Suggests:
BiocStyle,
org.Hs.eg.db,
knitr,
rmarkdown,
testthat,
TxDb.Hsapiens.UCSC.hg19.knownGene
LazyData: TRUE
RoxygenNote: 6.0.1
VignetteBuilder: knitr
URL: https://github.com/labbcb/GA4GHclient
BugReports: https://github.com/labbcb/GA4GHclient/issues
biocViews: DataRepresentation, GUI, ThirdPartyClient
Why bundle the Shiny app with the package? The API is obviously independent of Shiny, and there could be multiple GUIs built on top of it. Maybe split into a separate package?
The app()
function could accept an OrganismDb object, so no need for separate OrgDb and TxDb arguments in common cases.
The pageSize
argument is a bit awkward. What about returning a deferred request object with methods on generics like head()
that restrict the page counts. Coercion to data.frame
or something else would perform the request. Just seems more idiomatic.
Why does the user need to explicitly coerce the result of searchVariants()
to VCF? Maybe the vignette could explain how the variant.calls
object is useful prior to coercion.
Could searchVariantsByGRanges()
be merged into searchVariants()
? Could be one range argument that is coerced to GRanges, instead of several arguments that are effectively shared with the GRanges()
constructor. Then you could do e.g. searchVariants(host, "1:1-1000")
.
The Shiny app is shipped along with the package to facilitate development of the interface and to provide a quick way to interact with the server. I can put the app in a separate package if necessary.
Can I extract all data of TxDB package from an OrganismDb? The packages uses these functions on TxDB package: genes
, transcripts
, exons
and promoters
.
I understand that pageSize
seems to be awkward but is part of the GA4GH API schemas (Reference).
All search
functions return data.frame
inclusive searchVariants
. To keep consistency I don't change the data structure of the response data. I added converter for VariantAnnotation classes as a convenient way to get variant data from GA4GH-based server and export to VCF format file.
I kept the referenceGenome
, start
and end
arguments for searchVariants
to reflect the GA4GH API schema (reference). I added searchVariantsByGRanges
function as a convenient way to use existing GenomicRanges
objects as search parameters.
Your package has been approved for building. Your package is now submitted to our queue.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170309165956.html
Received a valid push; starting a build. Commits are:
f203eb1 Bump version for webhook
If the goal is to conform to GA4GH schemas then I guess there is room for a more idiomatic interface built on top of this.
Please do not be concerned about the Rsamtools error on windows; this seems to be a problem with the single package builder.
@lawremi can you expand on your comments, please?
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170309171442.html
By idiomatic, I mostly meant using the appropriate R data structure for the type of data. And using conventional means of restricting data instead of arguments that refer to web pages. And other stuff like that. But if the scope of this package is restricted to being a low-level API, then the current design seems adequate. It does make it strange to have an attached Shiny app though. Just some high-level comments without looking at the code.
@lawremi I would like the package to come closer to the R/Bioconductor than to the API. I am trying to this by using the base Bioconductor classes instead of base R types. The solution I see for removing the pageSize
argument is iterating until get all response data from server. However the computer memory will should keep large amount of data. The pageSize
method works like a chunk processing.
I could see a couple options:
1) A functional iteration interface like GenomicFiles::reduceByYield()
. Another useful interface would be like base::Find()
where a function returns a logical vector and iteration stops once there is a TRUE
in the return value, and returns the corresponding record.
2) Deferred execution of the request, so that calling the search function immediately returns an object representing the intent to perform the request. Then, the user could call head(result, 10)
, and the method would execute the query with pageSize=10
. The show()
method would only query the first 10, for example. Splitting by a Partitioning would create deferred chunks, each corresponding to a page range. Then, Find()
would just work. Coercion to a DataFrame would perform the complete iteration.
3) Just iterate, assuming that the range restriction is sufficient.
@lawremi I forgot to mention the full
parameter. When this parameter is TRUE
the search
functions will iterate over the request until get all response data. I think this is the expected behavior of your point 3. Do you think it would be more advisable to set this parameter to TRUE by default?
Yes, I think that's going to lead to the least surprise. When things are too big, the user can figure out how to limit it.
One suggestion: maybe make pageSize
default to NA_integer_
, and whenever it is the missing value, behave like full=TRUE
. Then you could just drop the full
argument. Maybe give pageSize
a more intuitive name, like nmax
or nrows
like read.table()
.
When full = TRUE
the pageSize
works more like n
parameter of ShortRead::FastqStreamer
function. It defines how many entries the server should return in each request (the server will not respect when the value is bigger than its maximum response length). It will need to have two parameters: 1) define the chunk size (useful when it is expected large response data because large value will reduce the number of requisitions), 2) define how many entries the response data should have (number of rows of the data frame). I like this solution because it does not need the full
and pageToken
parameters and it simplifies the response data object (can be the S4Vectors::DataFrame
instead of a derived class). What do you think @lawremi? Your feedback has been very helpful.
Makes sense to me.
Received a valid push; starting a build. Commits are:
855c289 Moved Shiny app to its own package
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170315141837.html
I did the modificafions suggested by @lawremi. Now search functions will retrieve all available entries unless the nrows
parameter is defined. The responseSize
specifies how many entries to get on each iteration. All get and search functions return DataFrame
object. The searchVariants
function also returns a data frame because the result contains the id of variants which is required to use getVariant
function.
Shiny app was moved to its own package: https://github.com/labbcb/GA4GHshiny
R-dev CMD build GA4GHclient: Ok
R-dev CMD INSTALL GA4GHclient: Ok
R-dev CMD build GA4GHclient: Ok
R-dev CMD INSTALL GA4GHclient: Ok
Maybe mention in the Description
that it is a low-level access for the GA4GH API.
ok.
Impressive vignette.
Nice man pages.
Follows Bioconductor how-to web query resource. Good job!
Comments missing for some functions eg: AllUtilities.R, eg: Line 72, and Line 81. High level description of what the function is supposed to do, gives the package reviewer an easy way to evaluate.
Good coding style.
Implement as.VCF()
as a coerece method so that the user can say as(variants, "VCF")
. Likewise for as.VCFHeader()
How do you plan to make sure that your package adheres to the GA4GH API?
What are the challenges you are facing to implement the missing functions you mentioned in the vignette? When do you plan to address them?
You could also make as.VCF()
a method on asVCF()
(that generic exists because it allows for additional arguments). I looked at as.VCF()
. Are you sure you want all info columns to be CharacterList? Use lengths()
instead of that sapply()
call.
@lawremi The searchVariants
function returns an DataFrame
(S4Vectors packages). Is it a good idea to implement asVCF
for x
as DataFrame
class? Or is it better to extend the DataFrame
class and then implement asVCF
for x
as derived class? What is the best way to implement the coerce method?
I am using CharacterList
at this moment because the server always return string type (reference). It was already fixed but I could not reproduce this feature. As soon I reproduce this, I will update the package.
They might all be strings but are they all multivalued also? Why wouldn't searchVariants()
just return a VCF directly? Or maybe a wrapper for doing that? An API should try hard not to discard semantics. Here, we are querying a data source of variants, dropping it to a DataFrame, and promoting back to a variant data structure. Would be nice to hide the DataFrame representation. If a low-level API has to expose it, make an explicit coercion function like makeVCFFromGA4GHResponse()
. The user would rarely call that though, thanks to the wrapper.
Returning VCF
class is a good idea because the data came from a VCF file. However there are columns that I didn't find a proper way to store in an VCF object. Maybe in metadata
? These columns are id
and variantSetId
. These two columns are very important to retrieve new data from the server, for example: 1) search a list of variants (no variant call data returned), 2) get only one variant by its id
(all variant call data returned), 3) do some stats.
Maybe the parameter asVCF = TRUE
for searchVariants
function. If true it will return VCF object otherwise a DataFrame. Is it a good solution?
The VCF spec already has an "ID" column, and it's supported by the VCF object. For "variantSetId", you might put it into the info component. An Rle would be an efficient way to stored it. That makes it easy to combine variant sets into a single object.
Thank you @lawremi I will do that.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170317133123.html
Received a valid push; starting a build. Commits are:
0a55730 Fix test unit of makeVCFFromGA4GHResponse
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170317134457.html
Received a valid push; starting a build. Commits are:
0981a07 Added missing request functions; minor changes in ...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170318174013.html
The searchVariants
, getVariant
and getVariantSet
return objects from VariantAnnotation
classes by default. They also return DataFrame
(when asVCF = FALSE
) It is useful for working with web interface.
@nturaga I added the comments and the missing request functions. The GA4GH schema is still in development. I am following and giving feedback for their work (https://github.com/ga4gh/ga4gh-schemas).
Received a valid push; starting a build. Commits are:
7f6056c Improved description text
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the following build report for more details:
http://bioconductor.org/spb_reports/GA4GHclient_buildreport_20170324121701.html
Looks like the major issues in the review have been addressed. I'll mark this as 'accepted' and add it to Bioconductor svn and nightly builds. More details in a few days via email.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.