amplican - Githubissues

JokingHero commented 8 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Repository: https://github.com/valenlab/amplican

Confirm the following by editing each check box to '[x]'

[ x ] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[ x ] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[ x ] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[ x ] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.

I am familiar with the essential aspects of Bioconductor software management, including:

[ x ] The 'devel' branch for new packages and features.
[ x ] The stable 'release' branch, made available every six months, for bug fixes.
[ x ] Bioconductor version control using Subversion (optionally via GitHub).

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 8 years ago

Hi @JokingHero

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: amplican
Type: Package
Title: fast and precise analysis of CRISPR experiments
Description: `amplican` creates reports of deletions, insertions, frameshifts,
    cut rates and other metrics in user selected format (preffered html). `amplican`
    uses vary fast C implementation of Gotoh alhoritm to align your fastq samples
    and automates analysis across different experiments. `amplican` maintains
    elasticity through configuration file, which with your fastq samples are only
    requirements.
Version: 0.99.0
Authors@R: c(
    person("Kornel", "Labun", email = "kornel.labun@gmail.com", role = "aut"),
    person(c("Rafael", "Nozal"), "Canyadas", email = "rafanozal@gmail.com",  role = "ctr"),
    person("Eivind", "Valen", email = "eivind.valen@gmail.com", role = c("cph", "cre"))
  )
URL: https://github.com/valenlab/amplican
BugReports: https://github.com/valenlab/amplican/issues
biocViews: Technology, qPCR, CRISPR
License: GPL-3
LazyData: TRUE
LinkingTo: Rcpp
Depends: R (>= 3.3.0)
Imports:
    Rcpp,
    utils,
    R.utils,
    seqinr,
    ShortRead,
    IRanges,
    GenomicRanges,
    S4Vectors,
    doParallel,
    foreach,
    ggplot2,
    ggbio,
    stringr,
    stats,
    rmarkdown,
    knitr,
    methods
RoxygenNote: 5.0.1
Suggests:
    testthat,
    BiocStyle
Collate:
    'RcppExports.R'
    'amplican.R'
    'helpers_warnings.R'
    'helpers_filters.R'
    'helpers_alignment.R'
    'gotoh.R'
    'amplicanAlign.R'
    'amplicanReport.R'
    'helpers_directory.R'
    'helpers_plots.R'
    'helpers_rmd.R'
VignetteBuilder: knitr

bioc-issue-bot commented 8 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/amplican_buildreport_20160919144907.html

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

29ab577 registered for bioc-devel mailing list

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/amplican_buildreport_20160920061043.html

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

b25ac06 windows should handle system.file example in ampli...

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/amplican_buildreport_20160920073636.html

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

2812aee reverted from using paste0 to system.file entirely...

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/amplican_buildreport_20160920081305.html

JokingHero commented 8 years ago

Hi,

moscato1 check is complaining:

\ checking loading without being on the library search path ... WARNING Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : there is no package called 'httr' Error: package or namespace load failed for 'amplican'

I am rather puzzled as I do not state any dependency on package 'httr' nor I do not need this package for anything I believe. Why is only windows having this problem not other systems? Should I add this package to dependencies?

Best, Kornel

titaiwangms commented 8 years ago

I got the same warning message as yours, don't know how to solve it, neither ..

mtmorgan commented 8 years ago

This seems to be a build system configuration issue that does not require a package author fix; @lshep might respond here with an update.

For this issue, httr is an indirect dependency via ggbio --> biovizBase --> ensembldb --> AnnotationHub

For https://github.com/Bioconductor/Contributions/issues/124, httr is an indirect dependency via methylumi --> minfi --> GEOquery

mtmorgan commented 8 years ago

Thanks for your contribution.

Please spell check the DESCRIPTION, man, vignette, ... pages!

DESCRIPTION

spell check

vignette

please spell-check your vignette, e.g., 'posible' 'coffe', ...
use a temporary location tempfile() or tempdir() in the vignette / examples, rather than writing to getwd().
avoid long lines of code, e.g., amplicanOverview.Rmd:97 by un-nesting function calls; don't use explicit path separators but rely on the function call to use the appropriate separator for the operating system in use.
```
fl <- system.file("extdata", "results", "barcode_reads_filters.csv",
    package = "amplican")
barcodeFilters <- read.csv(fl)
```

R

Since a major novelty in this package is the implementation of the Gotoh alignment algorithm, consider exposing this as a stand-alone function. It would operate on input objects (from ShortRead and Biostrings?) rather than files, and would return objects that can be computed on (Biostrings::PairwiseAlignment?)
use BiocParallel rather than doParallel / foreach for a more consistent interface across Bioconductor packages.
use file.path() rather than paste0() to construct file paths.
return meaningful values, perhaps using invisible(), from all functions. For instance, from amplicanPipeline() return results_folder
avoid misuse of ifelse() for scalar tests (e.g., amplicanAlign.R:128), use
```
result <- if (test) {
  ## TRUE value
} else {
  ## FALSE value
}
```
or similar (ifelse() is meant for use with vector arguments).

avoid repeated calls to writeLine, e.g., amplicanAlign.R:180

writeLines(
  c(paste("Config file:           ", "foo"), 
    paste("Processors used:       ", 2),
    paste("Skip Bad Nucleotides:  ", TRUE),
    ...),
  logFileConn)

use seq_len() / seq_along() rather than 1:n / 1:length(x)
choose an indentation scheme that does not result in highly indented, very short and illegible lines, as at amplicanAlign.R:222
The output format from gRCPP contains familar concepts (e.g., an alignment CIGAR) but in an idiosyncratic format. Present this information in a standard representation.
avoid direct slot access, e.g., helpers_alignment.R:181 ShortRead::sread(forwardsTable) rather than forwardsTable@sread, as(quality(reads), "matrix") instead of as(slot(reads, "quality"), "matrix") at helpers_filters.R:16.
is this line helpers_alignment.R:272 like Biostrings::reverseComplement()? Use Biostrings to reduce the number of package dependencies.
```
seqinr::c2s(rev(seqinr::comp(seqinr::s2c(guideRNA))))
```
'Hoist' common expressions outside for loops, to vectorize (speed up) calculations, e.g., the 'reverse complement' (?) commands at helpers_alignment.R:289 on individual rows of a data.frame can be applied to the entire column outside the loop. Indeed, much of this and other loops seem that they can be effectively vectorized.
minimize the number of times data is written to a file by returning vectors from calculations, and writing the vector. APPLY THESE CONSIDERATIONS TO ALL ITERATIONS
unpackFastq() is not necessary for input to ShortRead, if that is how it is being used.
use matrix-wise operations such as matrixStats::rowMins() rather than apply() at helpers_filters.R:16 and elsewhere
helpers_filters.R:62 use grepl("^[ATCG]+$", sread(reads)) or stringr::str_detect(sread(reads), "^[ATCG]+$") rather that sapply(). If the functionality of grepl() and str_detect() are approximately the same and relatively equivalent in terms of performance, then use grepl() to reduce the number of package dependencies. Can you provide an example of the error implied in the comment "possible c stack limits"?

src

use spaces or tabs for indentation, not both.

man

Please document the Gotoh algorithm, for instance, a literature citation
Please clearly mark functions that are not exported, so the user knows that they are not to be called directly.

Please address the points above, and when your package is again passing the build and check process correctly include a brief summary of your response to each of these points.

mtmorgan commented 8 years ago

Do you plan to submit a revised package in time for the current release? The deadline is today.

JokingHero commented 8 years ago

Yes, I will try to fix what I can. I am afraid fixing Gotoh implementation so that it "It would operate on input objects (from ShortRead and Biostrings?) rather than files, and would return objects that can be computed on (Biostrings::PairwiseAlignment?)" and "The output format from gRCPP contains familar concepts (e.g., an alignment CIGAR) but in an idiosyncratic format. Present this information in a standard representation." will not be possible to do today. It would be my goal to fix this in next release. We could still expose gRCPP function to the user, do you think we should do that? I have restricted from doing this before as I am also unhappy with inputs and outputs from the gRCPP function.

mtmorgan commented 8 years ago

I would rather see

Since a major novelty in this package is the implementation of the Gotoh alignment algorithm, consider exposing this as a stand-alone function. It would operate on input objects (from ShortRead and Biostrings?) rather than files, and would return objects that can be computed on (Biostrings::PairwiseAlignment?)

addressed prior to accepting the package; if this can be done in the next several days then we can be more relaxed about the deadline for adding new packages to the Bioconductor release.

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

4e31808 spell check all the files 2d307a0 more spell check 9a8c0dd using file.path(), identation to 2 spaces 9743239 avoid long lines in vignette, invisible(), removed... 915fcf5 save before removing unpackfastq 5b0dd41 removed unpacking and deleting zipped files 9e1f1a5 writing to files moved to highest level possible 21ff13b biocparallel and rerun on example dataset c630445 up one version, removed R.utils from DESCRIPTION 1e35a3d Merge pull request #1 from valenlab/bioc_review B...

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/amplican_buildreport_20161007151956.html

JokingHero commented 8 years ago

First of all thank you for all the feedback and comments.

In this update I tried my best to fix following comments:

[ x ] - Please spell check the DESCRIPTION, man, vignette, ... pages!

DESCRIPTION

[ x ] spell check

vignette

[ x ] please spell-check your vignette, e.g., 'posible' 'coffe', ...
[ x ] use a temporary location tempfile() or tempdir() in the vignette / examples, rather than writing to getwd().
[ x ] avoid long lines of code, e.g., amplicanOverview.Rmd:97 by un-nesting function calls; don't use explicit path separators but rely on the function call to use the appropriate separator for the operating system in use.

fl <- system.file("extdata", "results", "barcode_reads_filters.csv", package = "amplican") barcodeFilters <- read.csv(fl)

R

[ Not yet ready! ] Since a major novelty in this package is the implementation of the Gotoh alignment algorithm, consider exposing this as a stand-alone function. It would operate on input objects (from ShortRead and Biostrings?) rather than files, and would return objects that can be computed on (Biostrings::PairwiseAlignment?)
[ x ] use BiocParallel rather than doParallel / foreach for a more consistent interface across Bioconductor packages.
[ x ] use file.path() rather than paste0() to construct file paths.
[ x ] return meaningful values, perhaps using invisible(), from all functions. For instance, from amplicanPipeline() return results_folder
[ x (it was used with vector argument, removed elsewhere)] avoid misuse of ifelse() for scalar tests (e.g., amplicanAlign.R:128), use

result <- if (test) {

TRUE value

} else {

FALSE value

} or similar (ifelse() is meant for use with vector arguments).

[ x ] avoid repeated calls to writeLine, e.g., amplicanAlign.R:180

writeLines( c(paste("Config file: ", "foo"), paste("Processors used: ", 2), paste("Skip Bad Nucleotides: ", TRUE), ...), logFileConn)

[ x ] use seq_len() / seq_along() rather than 1:n / 1:length(x)
[ x (hopefully acceptable now)] choose an indentation scheme that does not result in highly indented, very short and illegible lines, as at amplicanAlign.R:222
[ Not yet ready! ] The output format from gRCPP contains familar concepts (e.g., an alignment CIGAR) but in an idiosyncratic format. Present this information in a standard representation.
[ x ] avoid direct slot access, e.g., helpers_alignment.R:181 ShortRead::sread(forwardsTable) rather than forwardsTable@sread
[ Could not find import for quality accessor no matter what I tried, package would not pass checks] , as(quality(reads), "matrix") instead of as(slot(reads, "quality"), "matrix") at helpers_filters.R:16.
[ x ] Biostrings::reverseComplement()? Use Biostrings to reduce the number of package dependencies.
[ x ] 'Hoist' common expressions outside for loops, to vectorize (speed up) calculations, e.g., the 'reverse complement' (?) commands at helpers_alignment.R:289 on individual rows of a data.frame can be applied to the entire column outside the loop. Indeed, much of this and other loops seem that they can be effectively vectorized.
[ x ] minimize the number of times data is written to a file by returning vectors from calculations, and writing the vector. APPLY THESE CONSIDERATIONS TO ALL ITERATIONS
[ x ] unpackFastq() is not necessary for input to ShortRead, if that is how it is being used.
[ x ] use matrix-wise operations such as matrixStats::rowMins() rather than apply() at helpers_filters.R:16 and elsewhere
[ x ] helpers_filters.R:62 use grepl("^[ATCG]+$", sread(reads)) or stringr::str_detect(sread(reads), "^[ATCG]+$") rather that sapply(). If the functionality of grepl() and str_detect() are approximately the same and relatively equivalent in terms of performance, then use grepl() to reduce the number of package dependencies.
[ Not yet ready! I need to run on large chunk of data overnight to try to replicate this error.] Can you provide an example of the error implied in the comment "possible c stack limits"?

src

[ x ] use spaces or tabs for indentation, not both.

man

[ x ] Please document the Gotoh algorithm, for instance, a literature citation
[ x (Exported functions are marked with "@export" tag, only they will be visible for the user. Is there something specific I should do to mark not exported functions?)] Please clearly mark functions that are not exported, so the user knows that they are not to be called directly.

JokingHero commented 8 years ago

After some discussion with maintainer we decided that we are going to switch from using our gotoh function to the Biostrings::pairwiseAlignment in this package. Which aligner we use is not the main substance of our contribution. amplican is meant as pipeline for high-throughput amplicon sequencing specialized for CRISPR experiments.

Also, we would like to wait with release for next Bioconductor schedule. We would like to test some more and gather more feedback from collaborators.

mtmorgan commented 8 years ago

OK, I will close this issue. Feel free to open a new issue when your updated package is ready.

JokingHero commented 7 years ago

Tried to open up new issue to submit amplican again, but bioc-issue-bot complains that I have already submitted this repository more than once and it exists in issue tracker. See #454. Can we open up this submission once more? @mtmorgan @gr22772

mtmorgan commented 7 years ago

Please perform a version bump.

JokingHero commented 7 years ago

I bumped the version to 0.9.100 as new start point, should it trigger the build automatically? Or should we check our web hooks? Or red error label "VERSION BUMP REQUIRED" prevents build?

mtmorgan commented 7 years ago

It should trigger a new build; when the build is successful the 'VERSION BUMP REQUIRED' tag will be removed. Can you check the web hook?

mtmorgan commented 7 years ago

There might be problems with having closed the issue; if the web hook is ok let me know and we'll work on it from this end.

JokingHero commented 7 years ago

I am sorry for so much delay, we had communication problems apparently. I confirmed that we do still have the web hook, could you work out some solution for our submission? Maybe it would be easier to close this issue, remove it from github completely and resubmit package (package changed so much that previous comments are no longer relevant I believe)? Or I could change the name to ampliCan - I made the name in lower case,so its easier to type, but the actual name on our logo is ampliCan.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/amplican_buildreport_20170828171931.html

mtmorgan commented 7 years ago

I suspect that your unit test fails because it uses the same directory on each architecture -- results_folder <- ... should be something like results_folder <- tempfile(); dir.create(results_folder).

Also please confirm on next version bump (an increment of to z+1 for version x.y.z is sufficient) that the web hook runs, or at least what the return value is, under settings --> web hooks --> edit and then choose the hook(s) and look at 'Response'.

JokingHero commented 7 years ago

Just made commit into 0.9.101 and web hook returns above.

mtmorgan commented 7 years ago

Great, thanks, I added the 'review in progress' label for the future, and triggered a manual rebuild.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/amplican_buildreport_20170829082652.html

JokingHero commented 7 years ago

Great! Thank you!

During code review, if you have any suggestions how to speed up getEventInfo function (helpers_general.R) it would be great as this is main bottleneck (not the alignment process in itself). The goal is to extract deletions, insertions and mismatches from PairwiseAlignmentsSingleSubject class into GRanges object with metadata columns. Main issue is that extracting deletions with natural Biostrings::deletion returns ranges not from the subject point of view. I get around this by shifting deletions for each insertion beforehand if any, but its slow. If there would be a way to vectorize this process (maybe C level Biostrings library?) I would be grateful for advice on how to achieve that. For this moment, current implementation works and is properly tested in test_alignment_helpers.R.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

353d7c9 allow for mismatches in primers, make sure plots h... 3cc8e86 improve consensus alghoritm so that it accounts fo... a94a0f1 change params of alignments, fix when no ins in va... dbfd7b5 add ampliconConsensus picture and explanation, rev...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/amplican_buildreport_20170918093042.html

mtmorgan commented 7 years ago

Sorry to be slow in returning comments. Below are some minor general things. If you can provide an easy way for me to run an example with getEventInfo() then I'll be happy to work on it in the short term.

Seems like the one-year anniversary of my initial review!

DESCRIPTION / NAMESPACE

good
CONSIDER: large number of imports makes he package fragile; are they all strictly necessary?

vignette

format long code lines so that they are easy to read (do not wrap) in the built vignette.

R

use accessors instead of direct slot access, e.g., in validity method replace object@experimentData with experimentData(object)x
AlignmentsExperimentSet-class.R:76 ensure that enough information is displayed to help the user rather than (as it appears) an overwhelming amount of data.
cat(), message(), warning(), stop() do not usually require paste() inside them, cat(paste("foo", "bar")) --> cat("foo", "bar")

amplicanAlign.R:79 it is better practice to leave BPPARAM unspecified, so the user has control through BiocParallel::register(). The conditional could be simplified as

if (use_parallel)
    p = BiocParallel::bpparam()     # user choice
else
    p = BiocParallel::SerialParam() # standard lapply
configSplit <- split(cfgT, f = cfgT$Barcode)
finalAES <- BiocParallel::bplapply(configSplit, FUN = makeAlignment,
                                   average_quality,
                                   min_quality,
                                   scoring_matrix,
                                   gap_opening,
                                   gap_extension,
                                   fastqfiles,
                                   primer_mismatch, BPPARAM=p)

amplicanAlign.R:104 do.call(c, x) is usually more efficient than Reduce(c, x)

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

de1ad33 bioconductor review + change gap opening to 25

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/amplican_buildreport_20170921135804.html

JokingHero commented 7 years ago

Thank you for feedback. Here is gist with getEventsInfo, if you can suggest something to make it faster, me and future users will be grateful.

DESCRIPTION / NAMESPACE

[ x ] good
[ x ] CONSIDER: large number of imports makes he package fragile; are they all strictly necessary? Unfortunately yes, for some of them I only use one function, but this is very high level package and we will take responsibility to keep up with package fixes.

vignette

[ ? ] format long code lines so that they are easy to read (do not wrap) in the built vignette. - If I understand correctly all variables shouldn't be chained eg. something(something2(something3(x))). Fixed that case.

R

[ x ] use accessors instead of direct slot access, e.g., in validity method replace object@experimentData with experimentData(object)
[ ? ] AlignmentsExperimentSet-class.R:76 ensure that enough information is displayed to help the user rather than (as it appears) an overwhelming amount of data. - Currently show method prints information only for the first experiment. I changed readCounts to print wth the use of str(). From my experience it is easier to manipulate object, if I can look up first element of it, even if its a bit longer.
[ x ] cat(), message(), warning(), stop() do not usually require paste() inside them, cat(paste("foo", "bar")) --> cat("foo", "bar")
[ x ] amplicanAlign.R:79 it is better practice to leave BPPARAM unspecified, so the user has control through BiocParallel::register(). The conditional could be simplified as

if (use_parallel) p = BiocParallel::bpparam() # user choice else p = BiocParallel::SerialParam() # standard lapply configSplit <- split(cfgT, f = cfgT$Barcode) finalAES <- BiocParallel::bplapply(configSplit, FUN = makeAlignment, average_quality, min_quality, scoring_matrix, gap_opening, gap_extension, fastqfiles, primer_mismatch, BPPARAM=p)

[ x ] amplicanAlign.R:104 do.call(c, x) is usually more efficient than Reduce(c, x) - This works as intended, do.call keeps it as list, while Reduce will merge all objects using c into AlignmentExperimentSet class.

mtmorgan commented 7 years ago

I looked quite a bit a getEventInfo, although I'm not actually familiar with the aligned string representations in Biostrings. I did not come up with meaning performance improvements. Some minor changes include:

Use the constructor rather than construct-and-assign in defGR(),

GenomicRanges::GRanges(
    ranges = x,
    strand = strand_info,
    seqnames = ID,
    originally = as.character(originally),
    replacement = as.character(replacement),
    type = type,
    read_id = names(x),
    score = score
)

minimize operations within conditionals, e.g.,

shift_to_subj <- function(x, ampl_shift, subject, strand_info) {
  if (strand_info == "+") {
    delta <- 1L
  } else {
    delta <- stringr::str_count(subject, "[ATCG]")
  }
  IRanges::shift(x, ampl_shift - delta)
}

and

width <- nchar(align)
if (strand_info == "+") {
  s_err <- width + ampl_shift - 1L >= ampl_len
  start <- width[!s_err] + ampl_shift
  end <- if (all(s_err)) integer() else ampl_len
} else {
  s_err <- width + abs(ampl_shift - ampl_len) >= ampl_len
  start <- if (all(s_err)) integer() else 1L
  end <- ampl_shift - width[!s_err]
}
sizes <- IRanges::IRanges(start = start, end = end, names = which(!s_err))

avoid nested iterations with vectorization, e.g.,

ins_r <- rep(ins, each = lengths(del))
del_s <- unlist(start(del))
sft <- -1 * sum(width(ins_r)[del_s > start(ins_r)])
del_sft <- relist(sft, del)
del <- IRanges::shift(del, del_sft)

this is actually a little slower than an intermediate solution that hoists the accessors out of the iteration:

shift_del <- mendoapply(function(x, y, w) {
    vapply(
        x,
        function(x_i, y, w) sum(w[x_i > y]),
        integer(1),
        y, w
    )
}, BiocGenerics::start(del), BiocGenerics::start(ins), BiocGenerics::width(ins))
del <- IRanges::shift(del, -1 * shift_del)

You can either incorporate these changes or not; let me know via a comment and I will accept the package.

JokingHero commented 7 years ago

Thank you for your effort and that you care! Do you think implementing some parts in C++ (maybe mendoapply(function(x, y, w) part) would give any benefits?

mtmorgan commented 7 years ago

no; about 30% of the time is in mismatchSummary(), which is doing complicated queries on events; it would be tedious and error-prone to do that in C. I think you could get a speed-up by iterating on the events part

  width <- nchar(align)
  subj <- as.character(subject(align))
  pat <- Biostrings::pattern(align)
  del <- Biostrings::deletion(align)
  ins <- Biostrings::insertion(align)
  mm <- Biostrings::mismatchSummary(align)$subject

and collapsing the result of the iteration into Vectors and a partitioning, allowing for vectorization, but that would be a little (not impossible) tedious.

mtmorgan commented 7 years ago

I'll accept this package now; further performance improvements can be pursued once it's in Bioconductor.

bioc-issue-bot commented 7 years ago

Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be sent to the maintainer email address in the next several days.

Thank you for contributing to Bioconductor!

JokingHero commented 7 years ago

Alright, Thank you!

mtmorgan commented 7 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (tithub.com/.keys is not empty), then no further steps are required. Otherwise, do the following:

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/

https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using biocLite("YOUR_PACKAGE_NAME"). The package 'landing page' will be created at

https://bioconductor.org/packages/YOUR_PACKAGE_NAME

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.

Bioconductor / Contributions

amplican #116

TRUE value

FALSE value