Closed jaehyunjoo closed 2 years ago
Hi @jaehyunjoo
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: poplin
Title: LC/MS Metabolomics Data Processing Utilities
Version: 0.99.0
Authors@R: c(
person(given = "Jaehyun",
family = "Joo",
role = c("aut", "cre"),
email = "jaehyunjoo@outlook.com"),
person(given = "Blanca",
family = "Himes",
role = c("aut"),
email = "bhimes@pennmedicine.upenn.edu")
)
Description: Defines a S4 class for storing LC/MS metabolomics data processing
results and provides utility functions for performing
imputation, normalization, dimension reduction, and common
visualizations.
License: GPL-3
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
biocViews: Metabolomics, MassSpectrometry, DataRepresentation, Preprocessing, Normalization
Imports:
BiocGenerics,
rlang,
ggplot2,
heatmaply,
methods,
S4Vectors,
stats,
Suggests:
limma,
Biobase,
hexbin,
Rtsne,
missForest,
vsn,
VIM,
pcaMethods,
pls,
testthat (>= 3.0.0),
BiocStyle,
knitr,
rmarkdown
Depends:
R (>= 4.1),
SummarizedExperiment
Config/testthat/edition: 3
VignetteBuilder: knitr
A reviewer has been assigned to your package. Learn what to expect during the review process.
IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.
Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details. This link will be active for 21 days.
Remember: if you submitted your package after July 7th, 2020,
when making changes to your repository push to
git@git.bioconductor.org:packages/poplin
to trigger a new build.
A quick tutorial for setting up remotes and pushing to upstream can be found here.
@jaehyunjoo I apologize for the delay. I will have a review for you within the next week.
@lshep No problem! Please take your time. I sincerely appreciate your time and help.
Please see initial review below:
Build Report
DESCRIPTION
BugReports
and URL
that points to
Github repositoryNAMESPACE
[ ] Did you choose to selectively import class functions for a reason like naming conflicts? Since you main class contains a SummarizedExperiment and be coerced using a SummarizedExperiment, for example, it may be worth doing the full import so that full functionality of the class can be utilized without any additional effort on your part?
[ ] .
functions normally indicate hidden/non-exported functions. Consider
removing .
from .verify_and_extract_input
README
[ ] Please also include Bioconductor installation instructions.
[ ] Are both md and Rmd format needed?
vignette
poplin.Rmd
so there is no chance of conflict. For instance when I just
do library and ask for the vignette I do not get your intro vignette, I get an
intro vignette for AzureGraph.> library(poplin)
> vignette("intro")
starting httpd help server ... done
Warning message:
vignette 'intro' found more than once,
using the one found in '/home/shepherd/R-Libraries/4.2-Bioc3.15/AzureGraph/doc'
[ ] Hiding ERROR and warnings in the vignette is not wise. It will prevent you from seeing when there is an error and potential problem with the vignette code. Please remove.
[ ] We recommend having an Installation
section in the
vignette for those that find the vignette on the landing page for the
package.
[ ] Instead of providing a secondary .gitignore
in the vignettes directory,
include these in the top level .gitignore
file.
data / man pages
R code/man pages
[ ] subset-methods.R.bak should be removed.
[ ] There is some concern whether an entirely new class structure is needed over utilizing the existing SummarizedExperiment/SingleCellExperiment class structure or class/method/functionality of existing packages like MSnbase, Qfeatures, RforMassSpectrometry.
I would like to defer comments on the code/class structure to one of our mass spectrometry/metabolomic developers. Please consider his comments are part of the formal review for comments.
@lgatto Could you please leave additional review comments.
Dear @jaehyunjoo
The first two letters have a direct match with a German word for butt, but the pronunciation wouldn't lead to any confusion. However, I can't figure out how the name relates to metabolomics or any feature related to data analysis, which might make the package difficult to find for new users.
The title and the description are a bit misleading, as the package focuses on quantitative metabolomics data processing and not raw LC/MS data at all. The functionality isn't specific to metabolomics; it is actually applicable to any quantitative omics data. As a result, I am not sure if the MassSpectrometry biocView is really relevant here.
There are no Introduction nor any Installation sections. In general, there's very little text that provides more context - the code looks like a repetition of the man pages.
There is a lot of redundancy with existing Bioconductor
packages. For example imputation and normalisation are are available
as basic functions that operate directly on matrices in the
MsCoreUtils
package and are re-used on higher-level
classes/packages, such as SummarizedExperiment
, QFeatures
,
MsFeatures
and MSnSet
instances. If any specific imputation or
normalisation wasn't available in MsCoreUtils
(random forest
imputation, for instance), it could of course be added. I think
there's a missed opportunity for integration between packages here.
There are some issues that cast doubts on the implementation and
actual need for the poplin
class. Indeed storing different
analysis steps as part of the same object (imputation,
normalisation, dimensionality reduction in this case) already exists
in Bioconductor: in a SummarizedExperiment
, these are all stored
as individual assays; in QFeatures
, these are stored as linked
SummarizedExperiment
s. Using a DataFrame
's columns doesn't seem
appropriate here, as shown below.
The poplinData
(and poplinReduced
) slot(s) is/are implemented as
DataFrame
(s) - why not use a list of matrix to be consistent with
SummarizedExperiment
s assay slots? The use of a DataFrame
is
easy to break using standard/valid accessors/code - below, for
example the colData
and the knn-imputed data aren't compatible
anymore.
> data(faahko_poplin)
> poplinData(faahko_poplin)
DataFrame with 206 rows and 2 columns
knn knn_cyclic
<matrix> <matrix>
1 1924712:1757151:1714582:... 20.4466:19.9522:20.0726:...
2 213659: 289501: 194604:... 17.3443:17.6197:17.2385:...
3 349011: 451864: 337473:... 18.0029:18.1785:17.9812:...
4 286221: 854341: 364300:... 17.6772:19.0195:18.0407:...
5 1160580:1018512:1345515:... 19.7946:19.1149:19.6982:...
... ... ...
202 116190: 441171: 88469.4:... 16.3391:17.9638:15.9276:...
203 210828: 880169:139328.8:... 17.2908:19.1672:16.7219:...
204 379543:1254097:206622.8:... 18.0925:19.5905:17.2340:...
205 575200: 166249:172773.6:... 18.6621:16.5454:16.8744:...
206 170253:3148507:165221.7:... 16.9727:20.9890:16.9570:...
> poplinData(faahko_poplin)[[1]] <- poplinData(faahko_poplin)[[1]][, 1:5]
> validObject(faahko_poplin)
[1] TRUE
> colData(faahko_poplin)
DataFrame with 12 rows and 2 columns
sample_name sample_group
<character> <character>
ko15.CDF ko15 KO
ko16.CDF ko16 KO
ko18.CDF ko18 KO
ko19.CDF ko19 KO
ko21.CDF ko21 KO
... ... ...
wt16.CDF wt16 WT
wt18.CDF wt18 WT
wt19.CDF wt19 WT
wt21.CDF wt21 WT
wt22.CDF wt22 WT
> dim(poplinData(faahko_poplin)[[1]])
[1] 206 5
This inconsistency highlights the possibly need to define a proper
validity methods. But simply, and preferably, storing these in the
original SummarizedExperiment
's assay slot would guard again this.
Note also that poplinReduced
is equivalent to the reducedDims
list that is available in the SingleCellExperiment
.
Why create aliases for widely used functions/accessors (such as
poplin_raw_list()
rather than assays()
and poplin_raw()
rather
than assay()
)? This comes with the substantial overhead of
redefining accessors and setters for assays, assay names, ...
The name of the raw slot isn't really appropriate here, as the data aren't raw data (in the LC/MS sense - see my comment about the title and description in the DESCRIPTION file), but quantitative data prior to any processing.
There's an overuse of S4 methods/generics. For example
poplinData()
, which is a rather specific name is defined as a
generic with a single method for the poplin
class. poplinData()
could simply be a function. Generally, all the generics in
AllGenerics.R
start with poplin
in their name, which begs the
question whether they should be generic - methods are implemented
for poplin
objects only (as in the example above) or also
additionally for matrices.
Along the same lines, the plotting function are implemented as S3 methods with default and poplin implementations. This isn't necessary, especially given the very specific name - functions will do it.
poplin_biplot <- function(x, ...) {
UseMethod("poplin_biplot")
}
poplin_boxplot <- function(x, ...) {
UseMethod("poplin_boxplot")
}
poplin_naplot <- function(x, ...) {
UseMethod("poplin_naplot")
}
## and so on
I spotted file R/subset-methods.R.bak
should be removed.
@lshep and @lgatto, Thank you so much for your insight and help. I will try to address the comments and improve the package.
@lshep
as @lgatto suggested, we definitely need to rename the package to something meaningful. I am wondering how to connect the new repository URL to this Github issue. Thanks a ton!
I think the best solution would be to open a new issue to make sure it gets added properly. When you submit it please let me know here and I can push it through the pre-review process and reassign myself to it.
Got it! Thank you so much for all your help.
I am going to close this issue. If you decide to move forward with a Bioconductor submission Please tag/reference this original issue so we know it is a continuation of this review. Cheers,
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand Bioconductor Package Naming Policy and acknowledge Bioconductor may retain use of package name.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
[x] I am familiar with the Bioconductor code of conduct and agree to abide by it.
I am familiar with the essential aspects of Bioconductor software management, including:
For questions/help about the submission process, including questions about the output of the automatic reports generated by the SPB (Single Package Builder), please use the #package-submission channel of our Community Slack. Follow the link on the home page of the Bioconductor website to sign up.