New package: flowLearn - Githubissues

mlux86 commented 6 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Repository: https://github.com/mlux86/flowLearn

Confirm the following by editing each check box to '[x]'

[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.

I am familiar with the essential aspects of Bioconductor software management, including:

[x] The 'devel' branch for new packages and features.
[x] The stable 'release' branch, made available every six months, for bug fixes.
[x] Bioconductor version control using Git (optionally via GitHub).

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 6 years ago

Hi @mlux86

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: flowLearn
Title: flowLearn
Version: 0.99.0
Authors@R: person("Markus", "Lux", email = "mlux@techfak.uni-bielefeld.de", role = c("aut", "cre"))
Author: Markus Lux [aut, cre]
Maintainer: Markus Lux <mlux@techfak.uni-bielefeld.de>
Description: flowLearn is an R package for predicting flow cytometry gating
    thresholds based on only density information and (very) few manual expert gates
    provided.
Depends:
    R (>= 3.4)
License: file LICENSE
Encoding: UTF-8
LazyData: true
Imports:
    proxy,dtw,parallel,stringr,cluster
RoxygenNote: 6.0.1
Suggests: knitr,
    rmarkdown
VignetteBuilder: knitr
biocViews: FlowCytometry

bioc-issue-bot commented 6 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

mtmorgan commented 6 years ago

As a preliminary requirement, please integrate your package with existing flow packages by sharing data representatations. Never use direct slot access @ but instead provide 'setter' and 'getter' functions.

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171115100439.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

5f59609 version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171115131104.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

73afc18 added more importFrom statements to conform to Bio...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171115131616.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

92f4ac8 added more importFrom statements to conform to Bio...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171115132038.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

6b26d61 added getter functions for slots of the DensityDat...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171115140052.html

mlux86 commented 6 years ago

Hello @mtmorgan. There seem to be improvements, at least the errors are fixed.

There are now two warnings left:

* Checking R Version dependency...
    * WARNING: Update R version dependency from 3.4 to 3.5.

Why is it necessary to update R?

* Checking individual file sizes...
    * WARNING: The following files are over 5MB in size:
      'data/flSampleBcellEvaluationData.rda'

Is that a serious problem? If so, what are my options to keep the file size down? Unfortunately, Flow Cytometry data is usually very large. I did my best to keep the file size at a minimum already. This data is used mainly for the vignette but also for method documentation. I consider it a crucial part of the documentation (and the other two data sets as well).

Furthermore, I provided getter methods for the DensityData class. Setter methods are not desired.

And last, can you please elaborate on " integrate your package with existing flow packages by sharing data representatations"? What exactly do you mean by that?

Thanks! Markus

mtmorgan commented 6 years ago

Your package is added to the 'devel' branch of Bioconductor, and the 'devel' branch uses R-devel (both the devel branch and R-devel will be what the user sees when your package is included in the next release) so R-3.5 is what we test on, and hence what is appropriate for advertising as the correct version dependency for R.

See the guidelines for overall Bioconductor package size and processing constraints. The cost is in hosting, building, and distributing 1500 packages. It is often a challenge to identify data sets that are realistic enough, especially for the vignettes, but also moderately sized. Our git server imposes the 5MB limit on individual sizes; this is by design (ours) but reflects the overall philosophy.

Thanks for the getters.

There's a rich collection of flow packages, starting with flowCore, and it is desirable that whateever functionality you introduce plays well with these packages. Usually this means sharing data structures rather than reinventing your own. I left this somewhat vague because you as domain expert are in a better position than I to identify appropriate classes.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

f5acff5 remove sample data from package (is now in extra/d...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171116082841.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

ceafb7c require minimum R version 3.5

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171116083352.html

mlux86 commented 6 years ago

Thank you for your comprehensive answer, @mtmorgan.

I now managed to get a build without errors or warnings.

As for re-using data representations from other packages: in flowLearn, there is no need for that (yet). The reason for that is that our tool needs to operate on a large set of densities and for this task we introduced the DensityData class. In the vignette, it is shown how to convert a flowFrame (as given by flowCore) into flowLearn's data structure. This is crucial for efficient computation. Other data structures are not required, yet.

Please let me know how to proceed.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

9870f25 version bump for bioconductor

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171121054054.html

mlux86 commented 6 years ago

Dear @mtmorgan,

haven't heard back from you for three weeks now. I assume you're busy. Would you please give an approximate time frame for the second review, if possible? I'm happy to contribute.

Thanks you, Markus

mtmorgan commented 6 years ago

Sorry for the delay. See the comments below. Feel free to ask for guidance (please include @mtmorgan in the comment) and, when revisions are complete, include a short summary of your response to each point.

DESCRIPTION

An 'academic only' license is generally not appropriate for Bioconductor packages.
if Authors@R is present, Authors: and Maintainer: are redundant (they are created by R CMD build)

R

Use SummarizedExperiment rather than 'DensityData' to represent your data. You'll then be using an established class familiar to your users. The 'DensityData' object seems like it consists of an 'x' and 'y' matrix, as well as annotations on rows; the representation as a data.frame with arbitrary column names does not seem correct or robust. Instead this data fits well with a SummarizedExperiment

url <- 'extra/data/flSampleFlowFrame.rds' flSampleFlowFrame <- readRDS(url)

library(SummarizedExperiment) cidx <- c("fcs", "population", "channelIdx", "gate.low", "gate.high") flData <- flData(flSampleDensdat) colData <- flData[, cidx] assays <- list( X = t(unname(as.matrix(flData)[, startsWith(colnames(flData), "x")])), Y = t(unname(as.matrix(flData)[, startsWith(colnames(flData), "y")])) ) se <- SummarizedExperiment(assays=assays, colData = colData)

The data are accessible as assay(se, "X"); assay(se, "Y") and the annotations are colData(se); se$fcs; the SummarizedExperiment experiment has matrix-like semantics, so dim(), [, etc work.

if you wish to represent your data as a special type of 'SummarizedExperiment' simply extend the base class

setClass("DensityData", contains = "SummarizedExperiment")
It is almost never necessary to use eval(parse(text= in code; flFind is

flFind <- function(x, fcs, population, channel) { cidx <- (x$ == fcs) & (x$population == population) & (x$channelIdx == i) x[, cidx] }

similarly flGetGate is assay(x)[, c("gate.low", "gate.high")], etc.
Instead of pointing the user to the vignette for instructions on how to interoperate with established flow cytometry classes, incorporate this into your R code so that the user an easily invoke a function that creates the appropriate object (DensityData / SummarizedExperiment) for use in other parts of your package.
carefully consider the need for stringr, rather than base gsub() to reduce dependencies on third-party packages.
carefully check limits, e.g., flowLearn.R:321 sapply(2:(lenA-1) for edge cases that your user may encounter, e.g., when lenA < 3.

use vectorization rather than iteration (e.g., for, sapply(), ...) where ossible, e.g., from flowLearn.R:321

difA <- sapply(2:(lenA-1), function(i) {
    (densA$y[i] - densA$y[i-1]) + ((densA$y[i+1] - densA$y[i-1]) / 2) / 2
})

would seem to be, for y = densA$y

i = 2:(lenA - 1)
difA = (y[i] - y[i -1]) + ((y[i + 1] - y[i - 1]) / 2) / 2

avoid unnecessary conditionals that complicate code reasoning, e.g., flowLearn.R:424 is always thresh <- mean(dens$x[idx]) regardless of length of idx.
use BiocParallel::bplapply() rather than parallel::makeCluster() to allow more portabilty across platforms; see the BiocParallel vignette for developer implementation guidelines.
use consistent indentation & spacing; consider running your code through the formatR package.

vignettes

line 158: 'hoist' this constant expression out of the for loop.
Use the data from a package rather than the web. If including the data in this package makes the package too large for a software pacakge, and the data is used only in the vignettes / examples, create an experiment data package or ExperimentHub resource containing (and documenting) the data. Load the data from there.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

885aa26 added Session.vim to gitignore b0cd7a1 removed Author and Maintainer fields 8fa6626 changed signature of flFind such that it does not ... 1924ca7 added new method flAddFlowFrame to construct a Den... f111073 removed stringr dependency e274ee9 fixed edge cases of user inputs; replaced calculat... 0f642a1 removed unnecessary conditions; replaced parallel ... 0696795 fixed mixed indentation and spacing ed6d5d1 use example data only from within package and not ... b3d737a version bump with changes from the 2nd review of B...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171211111802.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

cfb9341 use default params of BiocParallel, fix warnings; ...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171211113020.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

8fc422a fixed documentation; version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171211113712.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

450ecba fixed documentation; version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/flowLearn_buildreport_20171211114238.html

mlux86 commented 6 years ago

Dear @mmorgan,

Before I go into answering your comments: The Bioconductor build system indicates an error in the latest log. I cannot see the cause of the error and suspect something is going wrong due to the message 'BiocCheck' is not recognized as an internal or external command, operable program or batch file.. Can you please check?

Now for your comments: Thank you very much for taking the time to conduct this valuable review. I included most of your suggestions. Please find my comments inline with yours:

DESCRIPTION

An 'academic only' license is generally not appropriate for Bioconductor packages.

To our understanding, it is not an absolute requirement and we would like to continue with a free for academic use license.

if Authors@R is present, Authors: and Maintainer: are redundant (they are created by R CMD build)

The respective fields have been removed.

R

Use SummarizedExperiment rather than 'DensityData' to represent your data. You'll then be using an established class familiar to your users. The 'DensityData' object seems like it consists of an 'x' and 'y' matrix, as well as annotations on rows; the representation as a data.frame with arbitrary column names does not seem correct or robust. Instead this data fits well with a SummarizedExperiment

I agree that re-using existing data structures is important as it certainly increases understandability in the Bioconductor user base. However, I had a close look at the 'SummarizedExperiment' package, and due to the following reasons I think that using it is not a good idea:

It is primarily aimed at users of sequencing and microarray experiments. In contrast, flowLearn uses flow cytometry data. This may confuse users rather than increase familiarity. As an example, in sequencing or microarrays, the notion of an 'assay' is a fundamentally different one, and considering density x/y values as assays (as shown in your example) is confusing in my opinion.
Due to its main focus, depending on SummarizedExperiment would entail importing all of its dependencies as well. Especially 'GenomicRanges', 'GenomeInfoDb', 'IRanges', 'DelayedArray' do not provide any functionality that are of use for flowLearn. As you wrote in your comment about the 'stringr' package, one should reduce dependencies on third-party packages.
As I understand it, a main motivation for using the 'SummarizedExperiment' package is the re-use of its matrix like data structure. I argue that using a data frame (being itself a matrix structure), as done currently, is well suited for this task. To me it seems that using 'SummarizedExperiment' would just be a different way of representation and, quite frankly, I don't see the benefits of using it. You may well highlight those again for me.
The DensityData class provides methods of encapsulation (through the fl* functions), i.e. hides its functionality. Hence, users won't get in touch with the internal data frame, anyway.

Let me know what you think.

It is almost never necessary to use eval(parse(text= in code; flFind is [...code...] similarly flGetGate is assay(x)[, c("gate.low", "gate.high")], etc.

That's a good point. I changed the signature of flFind such that it does not have to use eval(parse(...))

Instead of pointing the user to the vignette for instructions on how to interoperate with established flow cytometry classes, incorporate this into your R code so that the user an easily invoke a function that creates the appropriate object (DensityData / SummarizedExperiment) for use in other parts of your package.

I agree and added a new function flAddFlowFrame() which takes care of this. The vignette has been updated to reflect this.

carefully consider the need for stringr, rather than base gsub() to reduce dependencies on third-party packages.

Indeed, this dependency is unnecessary. I removed it.

carefully check limits, e.g., flowLearn.R:321 sapply(2:(lenA-1) for edge cases that your user may encounter, e.g., when lenA < 3.

Changed. Another possible division by zero was also fixed in flEvalF1ScoreFCS.

use vectorization rather than iteration where possible

Changed.

avoid unnecessary conditionals that complicate code reasoning

Changed.

use BiocParallel::bplapply() rather than parallel::makeCluster() to allow more portabilty across platforms; see the BiocParallel vignette for developer implementation guidelines.

Good idea - replaced parallel with BiocParallel.

use consistent indentation & spacing; consider running your code through the formatR package.

Fixed.

vignettes

line 158: 'hoist' this constant expression out of the for loop.

Indeed, that was silly. Changed it.

Use the data from a package rather than the web. If including the data in this package makes the package too large for a software pacakge, and the data is used only in the vignettes / examples, create an experiment data package or ExperimentHub resource containing (and documenting) the data. Load the data from there.

Moved the data to the package itself. I now use only a subset of the data to keep the size at a minimum.

mtmorgan commented 6 years ago

Sorry for the delay and thanks for your updates.

Although there are 'legacy' Bioconductor packages with academic only licenses, we will not accept new packages with that licensing.

SummarizedExperiment is intentionally neutral about the biological nature of its content -- rowData(), colData(), assay(). While the weight of SummarizedExperiment might seem excessive for use of one package, it is used so heavily across packages that this cost is quickly amortized. The data.frame in your class relies on ad-hoc column naming and division of data.frame into row / column annotations versus 'data'; the class itself is forced to re-implement existing infrastructure, a basic example of this is the 'show' method, where your structure overwhelms the user with unnecessary information and details of data implementation, compared to the compact display and API hints provided by SummarizedExperiment. Another basic example is the intrinsic sanity checks provided by a matrix (homogenous data types) compared to a data.frame.

The main argument for SummarizedExperiment is the interoperability it provides. The main argument against DensityData is the additional engineering that would be required for it to be a robust data representation.

At the end of the day the goal of the review process is to incrementally improve the quality of packages, rather than to force them to some arbitrary standard. Take my suggestions in that vein, from someone with considerable experience in the software domain.

mtmorgan commented 6 years ago

Can you tell me how you wish to proceed? Thanks!

mlux86 commented 6 years ago

Dear Martin, I'm currently busy and do not have the time to pursue the issue. I'm closing it for now but will reopen it or create a new one as soon as time allows for it. Thank you so far for your suggestions and patience!

Bioconductor / Contributions

New package: flowLearn #553