Closed awaardenberg closed 5 years ago
Hi @awaardenberg
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: KinSwingR
Type: Package
Title: KinSwingR: network-based kinase activity prediction
Version: 0.99.3
Authors@R: person("Ashley J.", "Waardenberg", email =
"a.waardenberg@gmail.com", role=c("aut", "cre"))
Description: KinSwingR integrates phosphosite data derived from mass-spectrometry
data and kinase-substrate predictions to predict kinase activity. Several functions
allow the user to build PWM models of kinase-subtrates, statistically infer PWM:substrate
matches, and integrate these data to infer kinase activity.
License: GPL-3
Encoding: UTF-8
LazyData: true
Depends:
R (>= 3.5)
Imports:
data.table,
BiocParallel,
sqldf
biocViews: Proteomics
RoxygenNote: 6.0.1
Suggests: knitr,
rmarkdown
VignetteBuilder: knitr
A reviewer has been assigned to your package. Learn what to expect during the review process.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
208d7c3 Updated c2aca3c Remove prebuilt for BioC check c47b020 Update for Bioconductor 579a7e9 Merge branch 'master' of https://github.com/awaard...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
5158f80 Addressing Bioconductor
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
1af5e3b Addresssing Bioconductor
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Hi Martin,
I've pushed a couple of changes to address the errors and warnings. However, I'm getting a Warning from the set.seed(seed) within two functions (you'll see in the build report). It's there to allow for reproducible calls and is documented in the arguments. There is a way to remove it, but I'll have to remove a function swing.master() that makes calling 3 functions easier. What's the best way to deal with this? Or will this be fine for going through the review stage?
Cheers, Ash
If the user were wishing for repoducibility, couldn't they simply use set.seed in their own script?
Received a valid push; starting a build. Commits are:
30613ef Removing set.seed in functions Removing swing.mast...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
If the user were wishing for repoducibility, couldn't they simply use set.seed in their own script?
It's a fair point. I've removed the swing.master function, which is basically a wrapper script anyhow and moved the set.seed out of the functions and updated the documentation.
HI @mtmorgan, let me know what else I have to do :) Ash
DESCRIPTION
NAMESPACE
.
, these are easily confused with S3 dispatch. Alternatives are cleanAnnotation()
and clean_annotation()
.vignettes
swing()
, in an eval = FALSE
block? It would seem that this is the most important function to have evaluated. Perhaps choose a still meaningful but computationally less demanding example?R
use message()
rather than print()
to convey info, e.g., during verbose output. Note that message()
is usually sufficient, and that message(paste(..., sep=""))` and similar are not needed.
stop(cat(paste()))
is an unusual construct; probably just stop()
or rarely stop(..., paste(), ...)
.
score_sequences.r:215 usually it is better to NOT specify BPPARAM, so that the user has control over this via BiocParallel::register()
. Make sure your parallel computations work with register(SnowParam())
.
it looks like seq.score()
and swig.score()
are the work-horse functions of the package, and that they can be 'vectorized' for considerable speed-up. Can you provide a simple example of what the inputs to this function look like,
and I will try to provide some hints?
man
Hi Martin,
Thanks for reviewing the package. It looks like they should be pretty straightforward to address (and I certainly appreciate the input/help - especially on the vectorisation suggestion). I've just got a couple of questions below that would be good to guide me to see these getting done efficiently. If you could just respond to these, I'll get another push into my gituhub to address the minor points and then come back with an example for the vectorisation. Please see my questions below:
NAMESPACE
- avoid function (and variable, in the vignette) names separated with
.
, these are easily confused with S3 dispatch. Alternatives arecleanAnnotation()
andclean_annotation()
.
In addition the function names and variable names used in the vignette, should I remove all "." from any variable or parameter name in all of the R code?
vignettes
- why is the main function,
swing()
, in aneval = FALSE
block? It would seem that this is the most important function to have evaluated. Perhaps choose a still meaningful but computationally less demanding example?
It's just time consuming for the vignette build, with n=1000 permutations as the default. The same functions with cut down examples are used in the examples for the functions. I could take a smaller random sample and drop the permutations right down to 10 or so. It is also requires the score.sequences() function to run, which is also computationally demanding as you point out in the vectorisation, but again, I can trim this right down for a run-time example in vignette if required.
yes, I would avoid '.' everywhere.
My suspicion is that we'll address the performance issues elsewhere and that the vignette chunks will 'just work' as is; if not, I think a 'cooking show' model is appropriate, where you say and do the light-weight solution, then 'pull out of the oven' (i.e., a pre-computed, stored object) a more complete version.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Hi Martin,
I've addressed everything now (I've gone for the cooking show model at the moment for the vignette which means all the vignette code is evaluated now) in the last build.
Just the vectorisation suggestion remains. For testing seqScore (which was seq.score), the following can be used as an input example and is probably more straightforward than the swingScore. I think the swingScore function is slowest at swing.r:183, which does merge of tables - see what you think.
sample.set <- seq(from=-5, to=5, by=.0001)
pwm <- data.frame(cbind(sapply(seq_len(20), function(x) sample(sample.set, 20))), row.names = c("A","C","D","E","F","G","H","I","K","L","M","N","P", "Q","R","S","T","V","W","Y")) input_seq <- "PRRVRNLSAVLAART"
seqScore(input_seq, pwm)
I re-wrote seqScore as
seqScore <- function(input_seq, pwm) {
##vector of letters of sequence to score.
input_seq <- unlist(strsplit(input_seq, ""))
idx <- input_seq != "_"
ridx <- match(input_seq[idx], rownames(pwm))
cidx <- seq_along(input_seq)[idx]
sum(pwm[cbind(ridx, cidx)])
}
It is vectorized (each function evaluated once, rather than length(input_seq) times) and so more efficient. The essential part is that one can subset a matrix by a matrix with two columns (row index, column index) and get the values of the matrix. I figure out the row index using efficient, vectorized match()
. The column index is the index of the input sequence; I use seq_along()
instead of the equivalent but more verbose seq_len(length())
.
Probably similar approaches can be taken to other parts of the code?
At the next higher level, the sapply()
part of the double iteration
peptide_scores <-
bplapply(seq_len(length(pwm_in[[2]]$kinase)), function(i)
sapply(seq_len(nrow(input_data)), function(j)
seqScore(as.character(input_data[j, 2]), pwm_in[[1]][[i]])))
can probably be simplified to be a call to a (further modified) seqScore()
that does most of the operations in a vectorized fashion, with only the sum at the end with an iteration. If there's an easy way to get to this loop I can give it a look, too.
Is there an easy way to get to the slow part of swing.R:183?
Thanks for looking at this Martin - very much appreciated. I certainly like the changes you have done and yes, there's probably a few places I can do the same now. I'll start fiddling around.
For getting to the peptide_scores
loop, it's probably easiest to use the examples first (i.e. in the @example
section of score_sequences.r
) to generate pwms
and annotated_data
, then for inputs for peptide_scores
use the following in the loop:
pwm_in <- pwms
input_data <- annotated_data
For swing.R:183, it is merging the lists of data.frames (in data.table format - i.e. using keys for the latter merge) created after swing_permute
, into a single data.frame. I couldn't really think of a quicker solution for this - so ideas are certainly welcome :) Maybe the swingScore function itself I can have along took at. Let me know if you need more for this one.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Hi @mtmorgan, let me know if I can make anything easier. I've pushed the changed function fro the seqScore function now.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
I looked a little more at your code, and offer the following minor changes.
For the merge
operation, maybe it is easier for swingScore to simply return a vector, and to use simplify2array()
to coerce the list-of-vectors to a matrix / data.frame?
In the first few lines of swingScore()
, the sapply()
could be replaced by colSums()
; it seems like in example(swing)
there are NAs, and these propagate?
score <- data_merge$Sipk * data_merge[,7:ncol(data_merge)]
p_k <- colSums(score == 1, na.rm=TRUE)
#count significant negative
n_k <- colSums(score == -1, na.rm=TRUE)
Generally is is more efficient to update vectors rather than columns of a data frame, so delay creation of the data.frame until returning from the function. Since permutation is likely to be an expensive part of this computation, pay attention to calculating minimal necessary information for that case, and returning when that is accomplished.
This change
- "kinase" = colnames(data_merge[7:ncol(data_merge)]),
+ "kinase" = colnames(data_merge)[7:ncol(data_merge)],
subsets a character vector rather than an entire data.frame, and will be more efficient.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Thank you again @mtmorgan. I've implemented all of these changes now (including the fix for the na propagation)
Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be posed to this issue in the next several days.
Thank you for contributing to Bioconductor!
The master branch of your GitHub repository has been added to Bioconductor's git repository.
To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/awaardenberg.keys is not empty), then no further steps are required. Otherwise, do the following:
See further instructions at
https://bioconductor.org/developers/how-to/git/
for working with this repository. See especially
https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/
to keep your GitHub and Bioconductor repositories in sync.
Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at
https://bioconductor.org/checkResults/
(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using biocLite(\"KinSwingR\")
. The package 'landing page' will be created at
https://bioconductor.org/packages/KinSwingR
If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.