awaardenberg commented 5 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Repository: https://github.com/awaardenberg/KinSwingR

Confirm the following by editing each check box to '[x]'

[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.

I am familiar with the essential aspects of Bioconductor software management, including:

[x] The 'devel' branch for new packages and features.
[x] The stable 'release' branch, made available every six months, for bug fixes.
[x] Bioconductor version control using Git (optionally via GitHub).

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 5 years ago

Hi @awaardenberg

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: KinSwingR
Type: Package
Title: KinSwingR: network-based kinase activity prediction
Version: 0.99.3
Authors@R: person("Ashley J.", "Waardenberg", email =
    "a.waardenberg@gmail.com", role=c("aut", "cre"))
Description: KinSwingR integrates phosphosite data derived from mass-spectrometry
    data and kinase-substrate predictions to predict kinase activity. Several functions
    allow the user to build PWM models of kinase-subtrates, statistically infer PWM:substrate
    matches, and integrate these data to infer kinase activity.
License: GPL-3
Encoding: UTF-8
LazyData: true
Depends:
    R (>= 3.5)
Imports:
    data.table,
    BiocParallel,
    sqldf
biocViews: Proteomics
RoxygenNote: 6.0.1
Suggests: knitr,
    rmarkdown
VignetteBuilder: knitr

bioc-issue-bot commented 5 years ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

208d7c3 Updated c2aca3c Remove prebuilt for BioC check c47b020 Update for Bioconductor 579a7e9 Merge branch 'master' of https://github.com/awaard...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

5158f80 Addressing Bioconductor

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

1af5e3b Addresssing Bioconductor

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

awaardenberg commented 5 years ago

Hi Martin,

I've pushed a couple of changes to address the errors and warnings. However, I'm getting a Warning from the set.seed(seed) within two functions (you'll see in the build report). It's there to allow for reproducible calls and is documented in the arguments. There is a way to remove it, but I'll have to remove a function swing.master() that makes calling 3 functions easier. What's the best way to deal with this? Or will this be fine for going through the review stage?

Cheers, Ash

mtmorgan commented 5 years ago

If the user were wishing for repoducibility, couldn't they simply use set.seed in their own script?

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

30613ef Removing set.seed in functions Removing swing.mast...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

awaardenberg commented 5 years ago

If the user were wishing for repoducibility, couldn't they simply use set.seed in their own script?

It's a fair point. I've removed the swing.master function, which is basically a wrapper script anyhow and moved the set.seed out of the functions and updated the documentation.

awaardenberg commented 5 years ago

HI @mtmorgan, let me know what else I have to do :) Ash

mtmorgan commented 5 years ago

DESCRIPTION

ok

NAMESPACE

avoid function (and variable, in the vignette) names separated with ., these are easily confused with S3 dispatch. Alternatives are cleanAnnotation() and clean_annotation().

vignettes

why is the main function, swing(), in an eval = FALSE block? It would seem that this is the most important function to have evaluated. Perhaps choose a still meaningful but computationally less demanding example?

R

use message() rather than print() to convey info, e.g., during verbose output. Note that message() is usually sufficient, and that message(paste(..., sep=""))` and similar are not needed.
stop(cat(paste())) is an unusual construct; probably just stop() or rarely stop(..., paste(), ...).
score_sequences.r:215 usually it is better to NOT specify BPPARAM, so that the user has control over this via BiocParallel::register(). Make sure your parallel computations work with register(SnowParam()).
it looks like seq.score() and swig.score() are the work-horse functions of the package, and that they can be 'vectorized' for considerable speed-up. Can you provide a simple example of what the inputs to this function look like, and I will try to provide some hints?

man

good

awaardenberg commented 5 years ago

Hi Martin,

Thanks for reviewing the package. It looks like they should be pretty straightforward to address (and I certainly appreciate the input/help - especially on the vectorisation suggestion). I've just got a couple of questions below that would be good to guide me to see these getting done efficiently. If you could just respond to these, I'll get another push into my gituhub to address the minor points and then come back with an example for the vectorisation. Please see my questions below:

NAMESPACE

avoid function (and variable, in the vignette) names separated with ., these are easily confused with S3 dispatch. Alternatives are cleanAnnotation() and clean_annotation().

In addition the function names and variable names used in the vignette, should I remove all "." from any variable or parameter name in all of the R code?

vignettes

why is the main function, swing(), in an eval = FALSE block? It would seem that this is the most important function to have evaluated. Perhaps choose a still meaningful but computationally less demanding example?

It's just time consuming for the vignette build, with n=1000 permutations as the default. The same functions with cut down examples are used in the examples for the functions. I could take a smaller random sample and drop the permutations right down to 10 or so. It is also requires the score.sequences() function to run, which is also computationally demanding as you point out in the vectorisation, but again, I can trim this right down for a run-time example in vignette if required.

mtmorgan commented 5 years ago

yes, I would avoid '.' everywhere.

My suspicion is that we'll address the performance issues elsewhere and that the vignette chunks will 'just work' as is; if not, I think a 'cooking show' model is appropriate, where you say and do the light-weight solution, then 'pull out of the oven' (i.e., a pre-computed, stored object) a more complete version.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

8b62524 Updating to Version 0.99.8 for next push - removed... e05c623 Replace cat() and print() with message() for verbo... 8d9b32f Updating function names, argument and variables. U...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

awaardenberg commented 5 years ago

Hi Martin,

I've addressed everything now (I've gone for the cooking show model at the moment for the vignette which means all the vignette code is evaluated now) in the last build.

Just the vectorisation suggestion remains. For testing seqScore (which was seq.score), the following can be used as an input example and is probably more straightforward than the swingScore. I think the swingScore function is slowest at swing.r:183, which does merge of tables - see what you think.

sample data for testing:

sample.set <- seq(from=-5, to=5, by=.0001)

sample pwm for testing

pwm <- data.frame(cbind(sapply(seq_len(20), function(x) sample(sample.set, 20))), row.names = c("A","C","D","E","F","G","H","I","K","L","M","N","P", "Q","R","S","T","V","W","Y")) input_seq <- "PRRVRNLSAVLAART"

calling on the function:

seqScore(input_seq, pwm)

mtmorgan commented 5 years ago

I re-wrote seqScore as

seqScore <- function(input_seq, pwm) {
    ##vector of letters of sequence to score.
    input_seq <- unlist(strsplit(input_seq, ""))
    idx <- input_seq != "_"
    ridx <- match(input_seq[idx], rownames(pwm))
    cidx <- seq_along(input_seq)[idx]
    sum(pwm[cbind(ridx, cidx)])
}

It is vectorized (each function evaluated once, rather than length(input_seq) times) and so more efficient. The essential part is that one can subset a matrix by a matrix with two columns (row index, column index) and get the values of the matrix. I figure out the row index using efficient, vectorized match(). The column index is the index of the input sequence; I use seq_along() instead of the equivalent but more verbose seq_len(length()).

Probably similar approaches can be taken to other parts of the code?

At the next higher level, the sapply() part of the double iteration

  peptide_scores <-
    bplapply(seq_len(length(pwm_in[[2]]$kinase)), function(i)
      sapply(seq_len(nrow(input_data)), function(j)
        seqScore(as.character(input_data[j, 2]), pwm_in[[1]][[i]])))

can probably be simplified to be a call to a (further modified) seqScore() that does most of the operations in a vectorized fashion, with only the sum at the end with an iteration. If there's an easy way to get to this loop I can give it a look, too.

Is there an easy way to get to the slow part of swing.R:183?

awaardenberg commented 5 years ago

Thanks for looking at this Martin - very much appreciated. I certainly like the changes you have done and yes, there's probably a few places I can do the same now. I'll start fiddling around.

For getting to the peptide_scores loop, it's probably easiest to use the examples first (i.e. in the @example section of score_sequences.r) to generate pwms and annotated_data, then for inputs for peptide_scores use the following in the loop:

pwm_in <- pwms input_data <- annotated_data

For swing.R:183, it is merging the lists of data.frames (in data.table format - i.e. using keys for the latter merge) created after swing_permute, into a single data.frame. I couldn't really think of a quicker solution for this - so ideas are certainly welcome :) Maybe the swingScore function itself I can have along took at. Let me know if you need more for this one.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

24bf5ec vectorisation of seqScore; changing seq_len to seq... bfc0915 seq_len to seq_along 9ccb6c0 clean up 532d497 version bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

awaardenberg commented 5 years ago

Hi @mtmorgan, let me know if I can make anything easier. I've pushed the changed function fro the seqScore function now.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

mtmorgan commented 5 years ago

I looked a little more at your code, and offer the following minor changes.

For the merge operation, maybe it is easier for swingScore to simply return a vector, and to use simplify2array() to coerce the list-of-vectors to a matrix / data.frame?

In the first few lines of swingScore(), the sapply() could be replaced by colSums(); it seems like in example(swing) there are NAs, and these propagate?

    score <- data_merge$Sipk * data_merge[,7:ncol(data_merge)]
    p_k <- colSums(score == 1, na.rm=TRUE)
    #count significant negative
    n_k <- colSums(score == -1, na.rm=TRUE)

Generally is is more efficient to update vectors rather than columns of a data frame, so delay creation of the data.frame until returning from the function. Since permutation is likely to be an expensive part of this computation, pay attention to calculating minimal necessary information for that case, and returning when that is accomplished.

This change

-        "kinase" = colnames(data_merge[7:ncol(data_merge)]),
+        "kinase" = colnames(data_merge)[7:ncol(data_merge)],

subsets a character vector rather than an entire data.frame, and will be more efficient.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

7a6f5e4 version bump a914d97 update swingScore update to documentation added re... 4827c1a documentation update ff18d0b vignette update

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

awaardenberg commented 5 years ago

Thank you again @mtmorgan. I've implemented all of these changes now (including the fix for the na propagation)

bioc-issue-bot commented 5 years ago

Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be posed to this issue in the next several days.

Thank you for contributing to Bioconductor!

mtmorgan commented 5 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/awaardenberg.keys is not empty), then no further steps are required. Otherwise, do the following:

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using biocLite(\"KinSwingR\"). The package 'landing page' will be created at

https://bioconductor.org/packages/KinSwingR

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.

Bioconductor / Contributions

new package contribution #879

sample data for testing:

sample pwm for testing

calling on the function: