Syksy / curatedPCaData

Bioconductor R-package: Curated Prostate Cancer Data
Creative Commons Attribution 4.0 International
9 stars 4 forks source link

Bioconductor review #44

Closed Syksy closed 11 months ago

Syksy commented 1 year ago

Dear all, we've received Bioconductor review requests at: https://github.com/Bioconductor/Contributions/issues/3047#issuecomment-1591516244

Breaking it down (and to be updated in respect to what's been addressed):

curatedPCaData https://github.com/Bioconductor/Contributions/issues/3047

OK; As discussed in the submission thread for BioConductor at https://github.com/Bioconductor/Contributions/issues/3047#issuecomment-1625486043 , these downstream vignettes are to be moved to a new workflow package called curatedPCaWorkflow at: https://github.com/Syksy/curatedPCaWorkflow . Work will continue there-in regarding the proposed improvements. Please see answers below for specific actions taken toward this package-wide adjustment.

Multiple steps have been taken to address this and are further elaborated in answers below; briefly:

DESCRIPTION

OK

OK; Addressed in https://github.com/Syksy/curatedPCaData/commit/7ba59f5fb1bd917a0a0fc48e5245aad34cfec51b ; in addition, the package dependencies have been made more lean as the functions at /R/wrappers.R are no longer present, removing the import need for e.g. dplyr and stringr.

OK; CC-BY 4.0 should still be fine based on discussion at https://github.com/Bioconductor/Contributions/issues/3047#issuecomment-1625486043

NAMESPACE

OK; This has now been adjusted to using just native R tests in: https://github.com/Syksy/curatedPCaData/commit/4c1a481a82060d371da958c23980853ac0080ed5 and the tests themselves were subsequently extended to cover the new functions at https://github.com/Syksy/curatedPCaData/commit/32352235fd9f0df8c04fd5d677ae2fb53c2b984c and https://github.com/Syksy/curatedPCaData/commit/ed9d0b8b2b0fd1ab2c8fd7647224fa05a062caa8 .

vignettes

OK; this formatting issue has been addressed in https://github.com/Syksy/curatedPCaData/commit/bb76b0121fdd8fcd7e949c583160eb36d1c04db4 . Furthermore, the rest of the code has also been double-checked for formatting (multiples of 4 spacebar indents, no tabs, char width 80), of which vast majority are fixed in e.g. https://github.com/Syksy/curatedPCaData/commit/019a304e9dffaf0b92046a308486dd80b6b5eec8 . The 12 lines (~0% of all code) longer than 80 chars and 11 lines (~0% of all code) that violate 4*spacebar multiples are special cases such as long URLs or automatically generated package Rd TeX-like code, respectively.

OK; Notable effort has been put into revising the overview vignette so it exemplifies this effectively with generalizable code, while rest of the vignettes that had an analysis focus have been now shifted to the separate curatedPCaWorkflow-package. See comments below.

OK; Done in https://github.com/Syksy/curatedPCaData/commit/ee1c94bd8e09d9e6d072e5c5f27dd1655bcbc872 by moving to use curatedPCaWorkflow instead.

OK; This pointer has been addressed in the PR by @ACSoupir at https://github.com/Syksy/curatedPCaData/pull/45 . However, as the analyses vignette has now been incorporated into the workflow package instead, this fix was not merged into curatedPCaData. This fix will instead be incorporated into the curatedPCaWorkflow.

OK; After the other fixes both into the overview vignette as well as the generalized functions provided now in /R/getpcasummaries.R, all calls of type mae_obj[,"variable"] are now instead formatted as mae_obj$variable.

OK, at least partially; this revision request was not implemented as suggested here per se. I explored both the TCGAutils (and the curatedTCGAdata) for this purpose. However, I ended up with the conclusion that this would've been best implemented as part of the data curation step in curatedPCaData, rather than implemented here when we're just offering the "window" to the data. In all of our datasets, the sample_type field in colData was extracted from the original source and represents the sample types (primary, normal, metastatic, ...). In this respect, I feel it would be a bit much to bring in a dependency just for TCGA, when this information is already available in the colData. However, to address the original purpose of subsetting to certain sample types, I have now modified the main getPCa function to allow subsetting the creation of the MAE object via MultiAssayExperiment::subsetByColData which will utilize the sample_type available in our data : https://github.com/Syksy/curatedPCaData/commit/ba50efa4e16d266593589403709ae6a4ffa9f9cf . I hope this adjustment fulfills the requested improvement to the package - if not, please let me know and I will revise accordingly to the best of my ability.

OK; The oncoprint related functions and wrappers have been now moved to the new curatedPCaWorkflow package (e.g. commit https://github.com/Syksy/curatedPCaData/commit/62be8a42449421dac6888d1d3c787242be8ea192 ). In hindsight, they are somewhat downstream analyses already, and therefore are better suited for this workflow-package. Thus, the oncoprint and their related functions are no longer part of the curatedPCaData-package.

OK; This is a fair point, and indeed should've been the design already in the start. Notable effort has now been done to move the functionality previously coded inside the overview vignette, and made exported by the package itself accompanied by suitable examples, testing, and documentation. To enumerate these functions and key commits, of which majority resides now in the file located at R/getpcasummaries.R:

With the above functions now in use by the vignette and exported by the user, the revision(s) addresses the above mentioned issues and clean up the vignette (Rmd commits mainly in https://github.com/Syksy/curatedPCaData/commit/2f1b02a34948400e99cf18c88b5053b3b6b81981 , https://github.com/Syksy/curatedPCaData/commit/90a37f3ba3609a4870da170c8fc375bcaf475a16 , https://github.com/Syksy/curatedPCaData/commit/7abab8cab289c3f31e587341874ae6121d4a248b )

OK; The citations have now been revised as requested in https://github.com/Syksy/curatedPCaData/commit/1940b500d809ac06046f120d280bb9f2de49731a and https://github.com/Syksy/curatedPCaData/commit/dff5dee69bc1506bb6485913bea998d43db1ad86 .

OK; These mentioned triple backticks have been corrected to single backticks in https://github.com/Syksy/curatedPCaData/commit/0b46706998a9b783c023dcff28e79fa10f3045ea .

R

OK; The name of the argument has now been changed to assays in order to be more in line with the naming conventions in MultiAssayExperiment via commit https://github.com/Syksy/curatedPCaData/commit/633d17eb510a5d9c1375b88901af409d20fe718f .

OK; This function has now been moved to be as part of the curatedPCaWorkflow instead, as it relates to oncoprints ( https://github.com/Syksy/curatedPCaData/commit/62be8a42449421dac6888d1d3c787242be8ea192 ).

tests

After adjustments such as moving the wrappers focused on e.g. oncoprints out of the package ( https://github.com/Syksy/curatedPCaData/commit/62be8a42449421dac6888d1d3c787242be8ea192 ), adding new functions for replacing the ones previously defined inside vignette (e.g. https://github.com/Syksy/curatedPCaData/commit/6d1aaaae356f11c1cdc73715aba9540e353ee2cc , https://github.com/Syksy/curatedPCaData/commit/e4129376e1b6280bf14483baf908e51852bdfc43 , https://github.com/Syksy/curatedPCaData/commit/b6bffd1ef4b3e669583576e9acc256d5e9934d84 , https://github.com/Syksy/curatedPCaData/commit/71c73c6cc4b44401c730d9e7f268f8a9942fcf0c), and extending the testing coverage ( https://github.com/Syksy/curatedPCaData/commit/32352235fd9f0df8c04fd5d677ae2fb53c2b984c ), the output for this coverage testing is as follows:

covr::package_coverage() curatedPCaData Coverage: 89.60% R/getpca.R: 79.61% R/getpcasummaries.R: 94.87%

Thus, key functionality is tested, with the parts not covered related to marginal cases like download timeout.

ACSoupir commented 1 year ago

Will try and take a look today. Thanks for the thorough update. acs

ACSoupir commented 1 year ago

In analysis.Rmd, lines 126 and 135. Need to figure out the MatchedAssayExperiment class but appears that's where they are talking.

edit1: intersect results in 461 TCGA samples, MatchedAssayExperiment produces 400. Easy to adjust but lose samples.

ACSoupir commented 1 year ago

@varsha090597 If needed, can help with this as well. Mentioned helping rewrite the big nested if/for. Let me know.

Syksy commented 1 year ago

Thanks @ACSoupir for your help, highly appreciated; any chance you could work on this soon @varsha090597 as Alex is also willing to help out?

Syksy commented 1 year ago

In analysis.Rmd, lines 126 and 135. Need to figure out the MatchedAssayExperiment class but appears that's where they are talking.

edit1: intersect results in 461 TCGA samples, MatchedAssayExperiment produces 400. Easy to adjust but lose samples.

What are the sample types in these 461 and 400 samples? We'd like only primary samples in this case; is it possible that the 61 are normals and/or metastatic or other non-primary samples?

ACSoupir commented 1 year ago

Will have to see (lot on my plate right now).

From reading documentation it seems like the suggested function/class keeps samples that have data in all slots. It's more restrictive than the original intersect method, but don't know if intersect removed different samples that are now kept (i.e. intersect only kept primary, but there are primary and mets with complete data).

Syksy commented 1 year ago

Dear all, with all the check-boxes ticked with what my in opinion is a good revision, I'll be submitting this back to Bioconductor review.

The latest R CMD check is:

  • using log directory 'D:/Gits/curatedPCaData.Rcheck'
  • using R Under development (unstable) (2023-07-23 r84741 ucrt)
  • using platform: x86_64-w64-mingw32
  • R was compiled by gcc.exe (GCC) 12.2.0 GNU Fortran (GCC) 12.2.0
  • running under: Windows 10 x64 (build 19045)
  • using session charset: UTF-8
  • checking for file 'curatedPCaData/DESCRIPTION' ... OK
  • this is package 'curatedPCaData' version '0.99.2'
  • package encoding: UTF-8
  • checking package namespace information ... OK
  • checking package dependencies ... OK
  • checking if this is a source package ... OK
  • checking if there is a namespace ... OK
  • checking for executable files ... OK
  • checking for hidden files and directories ... OK
  • checking for portable file names ... OK
  • checking whether package 'curatedPCaData' can be installed ... OK
  • checking installed package size ... OK
  • checking package directory ... OK
  • checking 'build' directory ... OK
  • checking DESCRIPTION meta-information ... OK
  • checking top-level files ... OK
  • checking for left-over files ... OK
  • checking index information ... OK
  • checking package subdirectories ... OK
  • checking R files for non-ASCII characters ... OK
  • checking R files for syntax errors ... OK
  • checking whether the package can be loaded ... OK
  • checking whether the package can be loaded with stated dependencies ... OK
  • checking whether the package can be unloaded cleanly ... OK
  • checking whether the namespace can be loaded with stated dependencies ... OK
  • checking whether the namespace can be unloaded cleanly ... OK
  • checking loading without being on the library search path ... OK
  • checking dependencies in R code ... OK
  • checking S3 generic/method consistency ... OK
  • checking replacement functions ... OK
  • checking foreign function calls ... OK
  • checking R code for possible problems ... OK
  • checking Rd files ... OK
  • checking Rd metadata ... OK
  • checking Rd cross-references ... OK
  • checking for missing documentation entries ... OK
  • checking for code/documentation mismatches ... OK
  • checking Rd \usage sections ... OK
  • checking Rd contents ... OK
  • checking for unstated dependencies in examples ... OK
  • checking contents of 'data' directory ... OK
  • checking data for non-ASCII characters ... OK
  • checking data for ASCII and uncompressed saves ... OK
  • checking R/sysdata.rda ... OK
  • checking installed files from 'inst/doc' ... OK
  • checking files in 'vignettes' ... OK
  • checking examples ... OK
  • checking for unstated dependencies in 'tests' ... OK
  • checking tests ... Running 'native_tests.R' OK
  • checking for unstated dependencies in vignettes ... OK
  • checking package vignettes in 'inst/doc' ... OK
  • checking running R code from vignettes ... 'overview.Rmd' using 'UTF-8'... OK NONE
  • checking re-building of vignette outputs ... OK
  • checking PDF version of manual ... OK
  • DONE

Status: OK

The latest BiocCheck() is:

─ BiocCheckVersion: 1.37.7 ─ BiocVersion: 3.18 ─ Package: curatedPCaData ─ PackageVersion: 0.99.2 ─ sourceDir: D:/Gits/curatedPCaData ─ installDir: C:\Users\teemu\AppData\Local\Temp\Rtmp4OxmKI\file832452f04e5f ─ BiocCheckDir: D:/Gits/curatedPCaData.BiocCheck ─ platform: windows ─ isTarBall: FALSE

  • Installing package...
  • Checking for deprecated package usage...
  • Checking for remote package usage...
  • Checking for 'LazyData: true' usage...
  • Checking version number...
  • Checking version number validity... Package version 0.99.2; pre-release
  • Checking R version dependency...
  • Checking package size... Skipped... only checked on source tarball
  • Checking individual file sizes...
  • Checking biocViews...
  • Checking that biocViews are present...
  • Checking package type based on biocViews... ExperimentData
  • Checking for non-trivial biocViews...
  • Checking that biocViews come from the same category...
  • Checking biocViews validity...
  • Checking for recommended biocViews...
  • Checking build system compatibility...
  • Checking for blank lines in DESCRIPTION...
  • Checking if DESCRIPTION is well formatted...
  • Checking for proper Description: field...
  • Checking for whitespace in DESCRIPTION field names...
  • Checking that Package field matches directory/tarball name...
  • Checking for Version field...
  • Checking for valid maintainer...
  • Checking License: for restrictive use...
  • Checking for pinned package versions...
  • Checking DESCRIPTION/NAMESPACE consistency...
  • Checking .Rbuildignore...
  • Checking for stray BiocCheck output folders...
  • Checking for inst/doc folders...
  • Checking vignette directory...
  • Checking package installation calls in R code...
  • Checking for library/require of curatedPCaData...
  • Checking coding practice...
  • Checking parsed R code in R directory, examples, vignettes...
  • Checking function lengths...
    • NOTE: The recommended function length is 50 lines or less. There are 3 functions greater than 50 lines.
  • Checking man page documentation...
  • Checking package NEWS...
  • Checking unit tests...
  • Checking skip_on_bioc() in tests...
  • Checking formatting of DESCRIPTION, NAMESPACE, man pages, R source, and vignette source...
  • Checking if package already exists in CRAN...
  • Checking for bioc-devel mailing list subscription...
  • Checking for support site registration... Maintainer is registered at support site. Package name is in support site watched tags.

─ BiocCheck results ── 0 ERRORS | 0 WARNINGS | 4 NOTES

The 4 NOTEs (3 really since one is just due to not having admin access) in my opinion ought to be tolerated; the shorter lines are primarily URLs that I was not able to cut shorter while maintaining integrity. The space indents are automatically generated for authors by roxygen, and uses the 2-spacing type. The long functions in this context could not be split into parts reasonably.

Thus, I will be submitting version 0.99.2.

Syksy commented 11 months ago

The Bioconductor review process is now complete (accepted), and we'll continue working on this package as well as focus on the curatedPCaWorkflow. See: https://github.com/Bioconductor/Contributions/issues/3047#issuecomment-1795760676 Thanks all! Daniel