Bioconductor review - Githubissues

Syksy commented 1 year ago

Dear all, we've received Bioconductor review requests at: https://github.com/Bioconductor/Contributions/issues/3047#issuecomment-1591516244

Breaking it down (and to be updated in respect to what's been addressed):

curatedPCaData https://github.com/Bioconductor/Contributions/issues/3047

[x] Please separate the analysis functionality from the infrastructure. The package should only deliver the data to the user and demonstrate how to obtain the data from ExperimentHub.

OK; As discussed in the submission thread for BioConductor at https://github.com/Bioconductor/Contributions/issues/3047#issuecomment-1625486043 , these downstream vignettes are to be moved to a new workflow package called curatedPCaWorkflow at: https://github.com/Syksy/curatedPCaWorkflow . Work will continue there-in regarding the proposed improvements. Please see answers below for specific actions taken toward this package-wide adjustment.

[x] The package has extensive efforts to work with the metadata but these can be avoided by strategically using metadata from the ExperimentHub resources.

Multiple steps have been taken to address this and are further elaborated in answers below; briefly:

Functionality that previously existed inside the main overview vignette has been moved to /R/getpcasummaries.R, and these functions are generalizable to any requested variable, are not dataset-specific, and contain e.g. standard documentation and unit testing.
Functionality that was deemed to be more toward downstream analyses has been moved away from the package, in particular functions contained previously at /R/wrappers.R.
The new functions utilize strategically both the internally provided metadata.csv (for example function getPCaStudies) as well as the clinical metadata accessible via colData (for example functions getPCaSummaryTable, getPCaSummarySurv, getPCaSummarySamples , and getPCaSummaryStudies).

DESCRIPTION

[x] Looks good.

OK

[x] The BiocGenerics Suggests: seems out of place.

OK; Addressed in https://github.com/Syksy/curatedPCaData/commit/7ba59f5fb1bd917a0a0fc48e5245aad34cfec51b ; in addition, the package dependencies have been made more lean as the functions at /R/wrappers.R are no longer present, removing the import need for e.g. dplyr and stringr.

[x] The CC license is not good for software consider using a different license.

OK; CC-BY 4.0 should still be fine based on discussion at https://github.com/Bioconductor/Contributions/issues/3047#issuecomment-1625486043

NAMESPACE

[x] _Avoid importing testthat and the expect_s4class function. To test for S4 class, use the name of the class, e.g., is(X, "KnownS4Class")

OK; This has now been adjusted to using just native R tests in: https://github.com/Syksy/curatedPCaData/commit/4c1a481a82060d371da958c23980853ac0080ed5 and the tests themselves were subsequently extended to cover the new functions at https://github.com/Syksy/curatedPCaData/commit/32352235fd9f0df8c04fd5d677ae2fb53c2b984c and https://github.com/Syksy/curatedPCaData/commit/ed9d0b8b2b0fd1ab2c8fd7647224fa05a062caa8 .

vignettes

[x] Consider restricting the text to 80 column width. It makes it easier to review and maintain.

OK; this formatting issue has been addressed in https://github.com/Syksy/curatedPCaData/commit/bb76b0121fdd8fcd7e949c583160eb36d1c04db4 . Furthermore, the rest of the code has also been double-checked for formatting (multiples of 4 spacebar indents, no tabs, char width 80), of which vast majority are fixed in e.g. https://github.com/Syksy/curatedPCaData/commit/019a304e9dffaf0b92046a308486dd80b6b5eec8 . The 12 lines (~0% of all code) longer than 80 chars and 11 lines (~0% of all code) that violate 4*spacebar multiples are special cases such as long URLs or automatically generated package Rd TeX-like code, respectively.

[x] The demos in the vignettes should show the user how to obtain the data, it does not necessarily need to show the user how to analyze it. The latter would be more appropriate in a separate workflow package.

OK; Notable effort has been put into revising the overview vignette so it exemplifies this effectively with generalizable code, while rest of the vignettes that had an analysis focus have been now shifted to the separate curatedPCaWorkflow-package. See comments below.

[x] Remove the analysis vignette as it is out of the scope of the package.

OK; Done in https://github.com/Syksy/curatedPCaData/commit/ee1c94bd8e09d9e6d072e5c5f27dd1655bcbc872 by moving to use curatedPCaWorkflow instead.

[x] _Consider using MatchedAssayExperiment instead of intersecting colnames and "samplenames" in the colData (Decipher vs BCR section).

OK; This pointer has been addressed in the PR by @ACSoupir at https://github.com/Syksy/curatedPCaData/pull/45 . However, as the analyses vignette has now been incorporated into the workflow package instead, this fix was not merged into curatedPCaData. This fix will instead be incorporated into the curatedPCaWorkflow.

[x] _Consider using $ to access colData columns, e.g., mae_ren$sampletype

OK; After the other fixes both into the overview vignette as well as the generalized functions provided now in /R/getpcasummaries.R, all calls of type mae_obj[,"variable"] are now instead formatted as mae_obj$variable.

[x] Consider using TCGAutils::TCGAprimaryTumors for subsetting for primary tumor samples with TCGA data.

OK, at least partially; this revision request was not implemented as suggested here per se. I explored both the TCGAutils (and the curatedTCGAdata) for this purpose. However, I ended up with the conclusion that this would've been best implemented as part of the data curation step in curatedPCaData, rather than implemented here when we're just offering the "window" to the data. In all of our datasets, the sample_type field in colData was extracted from the original source and represents the sample types (primary, normal, metastatic, ...). In this respect, I feel it would be a bit much to bring in a dependency just for TCGA, when this information is already available in the colData. However, to address the original purpose of subsetting to certain sample types, I have now modified the main getPCa function to allow subsetting the creation of the MAE object via MultiAssayExperiment::subsetByColData which will utilize the sample_type available in our data : https://github.com/Syksy/curatedPCaData/commit/ba50efa4e16d266593589403709ae6a4ffa9f9cf . I hope this adjustment fulfills the requested improvement to the package - if not, please let me know and I will revise accordingly to the best of my ability.

[x] The Oncoprint example looks complicated for a new user. If you are working with TCGA data consider using TCGAutils::oncoPrintTCGA, otherwise consider adding a helper function.

OK; The oncoprint related functions and wrappers have been now moved to the new curatedPCaWorkflow package (e.g. commit https://github.com/Syksy/curatedPCaData/commit/62be8a42449421dac6888d1d3c787242be8ea192 ). In hindsight, they are somewhat downstream analyses already, and therefore are better suited for this workflow-package. Thus, the oncoprint and their related functions are no longer part of the curatedPCaData-package.

[x] _Reduce the amount of code in the overview.Rmd vignette. Pre-processing with eval(parse(...)) is usually not the way to go. It seems that more needs to be done at the dataset and package level to avoid the juggling of metadata in the vignette. Add helper functions in the package to extract and create a single row for each of your summaries of interests, for example, gleasons(maeobject) would return one row as given by the pre-processing step in the vignette.

OK; This is a fair point, and indeed should've been the design already in the start. Notable effort has now been done to move the functionality previously coded inside the overview vignette, and made exported by the package itself accompanied by suitable examples, testing, and documentation. To enumerate these functions and key commits, of which majority resides now in the file located at R/getpcasummaries.R:

getPCaSummaryTable : https://github.com/Syksy/curatedPCaData/commit/6d1aaaae356f11c1cdc73715aba9540e353ee2cc
getPCaSummarySurv : https://github.com/Syksy/curatedPCaData/commit/6d1aaaae356f11c1cdc73715aba9540e353ee2cc
getPCaSummarySamples : https://github.com/Syksy/curatedPCaData/commit/e4129376e1b6280bf14483baf908e51852bdfc43
getPCaSummaryStudies : https://github.com/Syksy/curatedPCaData/commit/b6bffd1ef4b3e669583576e9acc256d5e9934d84 and https://github.com/Syksy/curatedPCaData/commit/f8ef61ce7aec3e6da965afdbe65b5a5b46b19bea
getPCaStudies : https://github.com/Syksy/curatedPCaData/commit/71c73c6cc4b44401c730d9e7f268f8a9942fcf0c

With the above functions now in use by the vignette and exported by the user, the revision(s) addresses the above mentioned issues and clean up the vignette (Rmd commits mainly in https://github.com/Syksy/curatedPCaData/commit/2f1b02a34948400e99cf18c88b5053b3b6b81981 , https://github.com/Syksy/curatedPCaData/commit/90a37f3ba3609a4870da170c8fc375bcaf475a16 , https://github.com/Syksy/curatedPCaData/commit/7abab8cab289c3f31e587341874ae6121d4a248b )

[x] Consider using *.bib reference file rather than hard-coding references in the vignette.

OK; The citations have now been revised as requested in https://github.com/Syksy/curatedPCaData/commit/1940b500d809ac06046f120d280bb9f2de49731a and https://github.com/Syksy/curatedPCaData/commit/dff5dee69bc1506bb6485913bea998d43db1ad86 .

[x] Use single backticks to enclose code nouns, e.g. colData and MultiAssayExperiment

OK; These mentioned triple backticks have been corrected to single backticks in https://github.com/Syksy/curatedPCaData/commit/0b46706998a9b783c023dcff28e79fa10f3045ea .

R

[x] Use a more descriptive argument than slots in getPCa

OK; The name of the argument has now been changed to assays in order to be more in line with the naming conventions in MultiAssayExperiment via commit https://github.com/Syksy/curatedPCaData/commit/633d17eb510a5d9c1375b88901af409d20fe718f .

[x] Consider refactoring the logic in wrapperSortonco. It may be safer and more robust to apply a function recursively instead.

OK; This function has now been moved to be as part of the curatedPCaWorkflow instead, as it relates to oncoprints ( https://github.com/Syksy/curatedPCaData/commit/62be8a42449421dac6888d1d3c787242be8ea192 ).

tests

[x] Consider increasing the coverage:

covr::package_coverage() curatedPCaData Coverage: 24.14% R/wrappers.R: 0.00% R/getpca.R: 76.92%

After adjustments such as moving the wrappers focused on e.g. oncoprints out of the package ( https://github.com/Syksy/curatedPCaData/commit/62be8a42449421dac6888d1d3c787242be8ea192 ), adding new functions for replacing the ones previously defined inside vignette (e.g. https://github.com/Syksy/curatedPCaData/commit/6d1aaaae356f11c1cdc73715aba9540e353ee2cc , https://github.com/Syksy/curatedPCaData/commit/e4129376e1b6280bf14483baf908e51852bdfc43 , https://github.com/Syksy/curatedPCaData/commit/b6bffd1ef4b3e669583576e9acc256d5e9934d84 , https://github.com/Syksy/curatedPCaData/commit/71c73c6cc4b44401c730d9e7f268f8a9942fcf0c), and extending the testing coverage ( https://github.com/Syksy/curatedPCaData/commit/32352235fd9f0df8c04fd5d677ae2fb53c2b984c ), the output for this coverage testing is as follows:

covr::package_coverage() curatedPCaData Coverage: 89.60% R/getpca.R: 79.61% R/getpcasummaries.R: 94.87%

Thus, key functionality is tested, with the parts not covered related to marginal cases like download timeout.

ACSoupir commented 1 year ago

Will try and take a look today. Thanks for the thorough update. acs

ACSoupir commented 1 year ago

In analysis.Rmd, lines 126 and 135. Need to figure out the MatchedAssayExperiment class but appears that's where they are talking.

edit1: intersect results in 461 TCGA samples, MatchedAssayExperiment produces 400. Easy to adjust but lose samples.

ACSoupir commented 1 year ago

@varsha090597 If needed, can help with this as well. Mentioned helping rewrite the big nested if/for. Let me know.

Syksy commented 1 year ago

Thanks @ACSoupir for your help, highly appreciated; any chance you could work on this soon @varsha090597 as Alex is also willing to help out?

Syksy commented 1 year ago

In analysis.Rmd, lines 126 and 135. Need to figure out the MatchedAssayExperiment class but appears that's where they are talking.

edit1: intersect results in 461 TCGA samples, MatchedAssayExperiment produces 400. Easy to adjust but lose samples.

What are the sample types in these 461 and 400 samples? We'd like only primary samples in this case; is it possible that the 61 are normals and/or metastatic or other non-primary samples?

ACSoupir commented 1 year ago

Will have to see (lot on my plate right now).

From reading documentation it seems like the suggested function/class keeps samples that have data in all slots. It's more restrictive than the original intersect method, but don't know if intersect removed different samples that are now kept (i.e. intersect only kept primary, but there are primary and mets with complete data).

Syksy commented 1 year ago

Dear all, with all the check-boxes ticked with what my in opinion is a good revision, I'll be submitting this back to Bioconductor review.

The latest R CMD check is:

using log directory 'D:/Gits/curatedPCaData.Rcheck'

using R Under development (unstable) (2023-07-23 r84741 ucrt)

using platform: x86_64-w64-mingw32

R was compiled by gcc.exe (GCC) 12.2.0 GNU Fortran (GCC) 12.2.0

running under: Windows 10 x64 (build 19045)

using session charset: UTF-8

checking for file 'curatedPCaData/DESCRIPTION' ... OK

this is package 'curatedPCaData' version '0.99.2'

package encoding: UTF-8

checking package namespace information ... OK

checking package dependencies ... OK

checking if this is a source package ... OK

checking if there is a namespace ... OK

checking for executable files ... OK

checking for hidden files and directories ... OK

checking for portable file names ... OK

checking whether package 'curatedPCaData' can be installed ... OK

checking installed package size ... OK

checking package directory ... OK

checking 'build' directory ... OK

checking DESCRIPTION meta-information ... OK

checking top-level files ... OK

checking for left-over files ... OK

checking index information ... OK

checking package subdirectories ... OK

checking R files for non-ASCII characters ... OK

checking R files for syntax errors ... OK

checking whether the package can be loaded ... OK

checking whether the package can be loaded with stated dependencies ... OK

checking whether the package can be unloaded cleanly ... OK

checking whether the namespace can be loaded with stated dependencies ... OK

checking whether the namespace can be unloaded cleanly ... OK

checking loading without being on the library search path ... OK

checking dependencies in R code ... OK

checking S3 generic/method consistency ... OK

checking replacement functions ... OK

checking foreign function calls ... OK

checking R code for possible problems ... OK

checking Rd files ... OK

checking Rd metadata ... OK

checking Rd cross-references ... OK

checking for missing documentation entries ... OK

checking for code/documentation mismatches ... OK

checking Rd \usage sections ... OK

checking Rd contents ... OK

checking for unstated dependencies in examples ... OK

checking contents of 'data' directory ... OK

checking data for non-ASCII characters ... OK

checking data for ASCII and uncompressed saves ... OK

checking R/sysdata.rda ... OK

checking installed files from 'inst/doc' ... OK

checking files in 'vignettes' ... OK

checking examples ... OK

checking for unstated dependencies in 'tests' ... OK

checking tests ... Running 'native_tests.R' OK

checking for unstated dependencies in vignettes ... OK

checking package vignettes in 'inst/doc' ... OK

checking running R code from vignettes ... 'overview.Rmd' using 'UTF-8'... OK NONE

checking re-building of vignette outputs ... OK

checking PDF version of manual ... OK

DONE

Status: OK

The latest BiocCheck() is:

─ BiocCheckVersion: 1.37.7 ─ BiocVersion: 3.18 ─ Package: curatedPCaData ─ PackageVersion: 0.99.2 ─ sourceDir: D:/Gits/curatedPCaData ─ installDir: C:\Users\teemu\AppData\Local\Temp\Rtmp4OxmKI\file832452f04e5f ─ BiocCheckDir: D:/Gits/curatedPCaData.BiocCheck ─ platform: windows ─ isTarBall: FALSE

Installing package...

Checking for deprecated package usage...

Checking for remote package usage...

Checking for 'LazyData: true' usage...

Checking version number...

Checking version number validity... Package version 0.99.2; pre-release

Checking R version dependency...

Checking package size... Skipped... only checked on source tarball

Checking individual file sizes...

Checking biocViews...

Checking that biocViews are present...

Checking package type based on biocViews... ExperimentData

Checking for non-trivial biocViews...

Checking that biocViews come from the same category...

Checking biocViews validity...

Checking for recommended biocViews...

Checking build system compatibility...

Checking for blank lines in DESCRIPTION...

Checking if DESCRIPTION is well formatted...

Checking for proper Description: field...

Checking for whitespace in DESCRIPTION field names...

Checking that Package field matches directory/tarball name...

Checking for Version field...

Checking for valid maintainer...

Checking License: for restrictive use...

Checking for pinned package versions...

Checking DESCRIPTION/NAMESPACE consistency...

Checking .Rbuildignore...

Checking for stray BiocCheck output folders...

Checking for inst/doc folders...

Checking vignette directory...

Checking package installation calls in R code...

Checking for library/require of curatedPCaData...

Checking coding practice...

Checking parsed R code in R directory, examples, vignettes...

Checking function lengths...

NOTE: The recommended function length is 50 lines or less. There are 3 functions greater than 50 lines.

Checking man page documentation...

Checking package NEWS...

Checking unit tests...

Checking skip_on_bioc() in tests...

Checking formatting of DESCRIPTION, NAMESPACE, man pages, R source, and vignette source...

NOTE: Consider shorter lines; 12 lines (0%) are > 80 characters long.

NOTE: Consider multiples of 4 spaces for line indents; 11 lines (0%) are not. See https://contributions.bioconductor.org/r-code.html See styler package: https://cran.r-project.org/package=styler as described in the BiocCheck vignette.

Checking if package already exists in CRAN...

Checking for bioc-devel mailing list subscription...

NOTE: Cannot determine whether maintainer is subscribed to the Bioc-Devel mailing list (requires admin credentials). Subscribe here: https://stat.ethz.ch/mailman/listinfo/bioc-devel

Checking for support site registration... Maintainer is registered at support site. Package name is in support site watched tags.

─ BiocCheck results ── 0 ERRORS | 0 WARNINGS | 4 NOTES

The 4 NOTEs (3 really since one is just due to not having admin access) in my opinion ought to be tolerated; the shorter lines are primarily URLs that I was not able to cut shorter while maintaining integrity. The space indents are automatically generated for authors by roxygen, and uses the 2-spacing type. The long functions in this context could not be split into parts reasonably.

Thus, I will be submitting version 0.99.2.

Syksy commented 11 months ago

The Bioconductor review process is now complete (accepted), and we'll continue working on this package as well as focus on the curatedPCaWorkflow. See: https://github.com/Bioconductor/Contributions/issues/3047#issuecomment-1795760676 Thanks all! Daniel

Syksy / curatedPCaData

Bioconductor review #44

curatedPCaData https://github.com/Bioconductor/Contributions/issues/3047

DESCRIPTION

NAMESPACE

vignettes

R

tests