OHDSI / Hades

Health Analytics Data-to-Evidence Suite (HADES): A collection of R packages for performing analytics against the Common Data Model.
https://ohdsi.github.io/Hades
Apache License 2.0
23 stars 12 forks source link

HADES-wide dependency management and strategy #11

Open schuemie opened 2 years ago

schuemie commented 2 years ago

Dependencies are bad. They take a long time to install (e.g. by renv), and tend to be the cause when things break. We therefore would like to minimize our number of dependencies.

Some HADES packages have a lot of dependencies, and one reason is that some direct dependencies have a large list of dependencies themselves. If possible, different HADES packages should have overlapping dependencies, to minimize the overall dependencies of HADES. Therefore, one effort could focus on analyzing the list of dependencies per HADES package, and the overlap between dependencies.

If possible, we'd like to restrict our set of dependencies to those that we believe are reliable. Criteria could be defined for what are considered 'reliable' dependencies, based for example on age and development state of the package, and an assessment of the community behind the package. For example, most of tidyVerse is probably reliable. Once criteria have been established, we could form a list of 'approved' packages we can depend on.

ablack3 commented 2 years ago

This conversation might be better framed as a tradeoff between "build yourself" vs "borrow and share", a decision that Stephen Wali outlined in the open source workshop earlier this year. A significant benefit of taking on dependencies is that we save ourselves the work of implementing and maintaining some functionality. I like the idea trying to limit dependencies and analyzing our dependencies. If a package is only using one function from a large package it might make sense to implement single function and remove the dependency. We also have the option to add dependencies as SUGGESTS packages for optional functionality that might only be needed by a subset of users (e.g. some PLP libraries for example).

I'd like to suggest rOpenSci as a potential source for reliable packages we can depend on. I second that RStudio's tidyverse packages (as well as supporting packages like rlang) tend to be reliable and RStudio seems to do good job of communicating changes using lifecyles. This would be a great project for someone in Kheiron.

This talk from the 2020 RStudio conference outlines their strategy around lifecycle management (starting at 13 minutes in). https://www.rstudio.com/resources/rstudioconf-2020/state-of-the-tidyverse/

Criteria suggestions for discussion:

"Must" indicates a requirement. "Should" indicates a guideline.

schuemie commented 2 years ago

Thanks @ablack3 , points well taken.

I don't think we currently depend on any rOpenSci packages?

I would love for someone in Khieron to work on this. A first step would be an analysis of current dependencies. Then proposal for requirements and guidelines, followed by implementation of those.

gkovaig commented 2 years ago

@schuemie and @ablack3 I am part of the 2022 Khieron (or is it Kheiron?) cohort and would like to see if I have the chops to do this.

This can be broken down into two phases: (a) build a database of direct dependencies (b) scan for candidates which can be substituted based on frequency of usage across packages, availability of alternatives and effort involved.

If I were to start with https://ohdsi.github.io/Hades/packageStatuses.html, go through each github repo listed here:

  1. Fetch file NAMESPACE (automatically created by roxygen), parse it for the import statements. Cons: depends on appropriate use of roxygen syntax, and not all import statements may be captured; e.g. DQD repo does not have many entries in its NAMESPACE
  2. Look for renv.lock files. This would be a comprehensive and accurate source of dependencies. Cons: renv.lock files are not captured in the github repo (e.g. CohortDiagnostics excludes it with .gitignore)
  3. Scan all .R files for library() statements. Also scan for libraries invoked using Abc::def() syntax. Cons: lots of scanning

Your thoughts?

ablack3 commented 2 years ago

Awesome @gkovaig!

I like your idea of using the namespace file so we can track which functions are being imported.

I found some ideas on how to quickly list all package dependencies here.

Here is one idea for getting at the list of direct dependencies.

(Update 7/20: add drat repo to capture non-CRAN Hades packages)

library(dplyr)

# install.packages("drat")
# install all of the Hades packages - can take a few tries.
# devtools::install_github("OHDSI/Hades")

drat::addRepo("OHDSI") # so that available.packages includes Hades packages not on CRAN
hadesPackages <- Hades::listHadesPackages()[,"name"]

df <- available.packages() %>% 
  as_tibble() %>% 
  filter(Package %in% hadesPackages) %>% 
  select(Package, Depends, Imports, Suggests) %>% 
  tidyr::pivot_longer(cols = c(-Package), names_to = "type", values_to = "dependency") %>% 
  mutate(dependency = stringr::str_split(dependency, ",")) %>% 
  tidyr::unnest(c(dependency)) %>% 
  mutate(dependency = stringr::str_trim(dependency))

df
#> # A tibble: 415 × 3
#>    Package   type    dependency
#>    <chr>     <chr>   <chr>     
#>  1 Andromeda Depends dplyr     
#>  2 Andromeda Imports RSQLite   
#>  3 Andromeda Imports DBI       
#>  4 Andromeda Imports zip       
#>  5 Andromeda Imports methods   
#>  6 Andromeda Imports dbplyr    
#>  7 Andromeda Imports tidyselect
#>  8 Andromeda Imports cli       
#>  9 Andromeda Imports rlang     
#> 10 Andromeda Imports pillar    
#> # … with 405 more rows
#> # ℹ Use `print(n = ...)` to see more rows

# list Hades packages that are not included
setdiff(hadesPackages, df$Package)
#> [1] "EnsemblePatientLevelPrediction"

readr::write_csv(df, "hadesDependencies.csv")

df %>% 
  count(dependency, sort = T) %>% 
  print(n=20)
#> # A tibble: 166 × 2
#>    dependency       n
#>    <chr>        <int>
#>  1 testthat        22
#>  2 knitr           19
#>  3 rmarkdown       19
#>  4 rlang           16
#>  5 dplyr           13
#>  6 methods         12
#>  7 rJava            9
#>  8 Eunomia          8
#>  9 readr            8
#> 10 withr            8
#> 11 checkmate        7
#> 12 ggplot2          7
#> 13 R (>= 3.5.0)     7
#> 14 RJSONIO          6
#> 15 utils            6
#> 16 cli              5
#> 17 jsonlite         5
#> 18 shiny            5
#> 19 stringr          5
#> 20 survival         5
#> # … with 146 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Created on 2022-07-20 by the reprex package (v2.0.1)

It would be nice to know which functions are being used from these and how many times they are used. renv::dependencies function will scan source code for dependencies so that might be helpful.

ablack3 commented 2 years ago

Law of dependent (software) origination: "All programs arise in dependence upon other programs" 😆

gkovaig commented 2 years ago

Thank you @ablack3 - that link to SO was super helpful! You already have a first-cut solution here!

Building on your idea, a core issue (that renv is trying to address) is that of versions of those dependencies, which in turn is tied to the version of R being used. I am really intrigued that we are not saving the renv.lock file as part of the github repo. Isn't that meant to capture the ground truth of R and package dependencies at the finest granularity? (What would still be missing is OS dependencies; that would have to be addressed using containers.)

Perhaps I have looked at samples which are the exception, rather than the rule.

@schuemie I recall seen some posts where you have passionately advocated for using renv. Can you clarify if (a) all HADES packages are expected to use renv (b) if the contents of renv.lock are meant to be version-controlled in a github repo.

ablack3 commented 2 years ago

renv is used to reproducibly run R analysis scripts/projects and is not intended to be used as a way to capture R package dependencies.

The renv package is a new effort to bring project-local R dependency management to your projects. https://rstudio.github.io/renv/articles/renv.html

For package dependencies we use the DESCRIPTION file which allows you specify all dependencies (and their exact or minimum versions if you need to).

A dependency is a code that your package needs to run. Dependencies are managed by two files. The DESCRIPTION manages dependencies at the package level; i.e. what packages needs to be installed for your package to work. R has a rich set of ways to describe different types of dependencies. A key point is whether a dependency is needed by regular users or is only needed for development tasks or optional functionality. https://r-pkgs.org/dependencies.html

We need renv for ohdsi-studies because we need to specify the exact environment that the study ran under so it can be reproduced. The main difference R packages and R projects is that packages are imported into other packages or projects so they need to have their own namespace and need to be maintained over time. R projects are what you might think of as a data analysis. They are not maintained and are not imported into other projects. They do not need a namespace. However the line can be blurred a bit and we can adopt aspects of R packages into our studies/projects if we want. For example we can add a tests directory to a project and use the testthat to add unit tests to a project. If you want to make your project into a full blown R package that is fine too but usually a data analysis does not need as much structure as a package because it only needs to run once. In OHDSI, since we are running a study across multiple data nodes, we have tended to use R packages as projects (which is why they have both a DESCRIPTION file and an renv.lock file). But these packages often just export a single execute function and are not meant to be imported into other packages or maintained after the study is completed. They really don't need their own namespace. So the line can be blurred but I think it is good to make a clear distinction between analysis code (user facing) and the internals of tools/packages (developer facing).

R packages (i.e. tools maintained by developers) contain functionality that is used/imported into a study or data analysis. Users care that the functions in your package work today (the day that they are using the function). That means as an R package author your job is to make sure your code works today with the current versions of all of its dependencies. This means you don't need to worry too much about updates to other packages that don't break your code. It really does not matter if one of your dependencies is updated as long as the interface (function input/output) you are using does not change. It also means that if a lot of packages depend on your code then you should be very careful about changing the interface and I think CRAN expects some effort to that effect. This is perhaps a different perspective than one taken in other languages like Java where it is common to specify exact versions for all dependencies.

A good analogy in the OHDSI world is the difference between a "concept set expression" and a "list of concept ids". A "concept set expression" resolves to a list of concepts based on the current vocabulary version. A list of concept IDs is immutable and will always represent the exact same concept set. The DESCRIPTION file lists direct dependencies that resolve to a complete (recursive) dependency list based on the current versions of all dependencies. The renv.lock file captures the complete (recursive) dependency list so that the whole R code stack stays exactly the same which is needed for reproducibility.

In the DESCRIPTION file you can specify an exact package version for your dependency but people usually don't do this because it is an unnecessary restriction. More often if a dependency introduced a breaking change we would just say the version has to be >= (greater than or equal to) the one that introduced the change.

The key thing I think we should remember is that we are depending on interfaces (function inputs/outputs) and only need to worry about breaking changes to those interfaces. Stability in our software is about stable interfaces.

Sorry for jumping on the soapbox a bit. I have a lot to say on this topic apparently 😃

ablack3 commented 2 years ago

@gkovaig Here are some resources that might be helpful.

RUniverse: technical, social, and scientific indicators of the health of open source R packages. https://www.rstudio.com/resources/rstudioglobal-2021/monitoring-health-and-impact-of-open-source-projects/ https://r-universe.dev/

CRAN downloads app https://hadley.shinyapps.io/cran-downloads/

As an example, Hades is using both jsonlite and RJSONIO but I suspect we might be able to pick one. We can see that jsonlite is much more heavily downloaded than RJSONIO.

image

Comparing the number of reverse dependencies and recursive reverse dependencies between RJSONIO and jsonlite shows us that many more packages rely on jsonlite than RJSONIO.

library(tools)
summary(package_dependencies(c("RJSONIO", "jsonlite"), reverse = T))
#>          Length Class  Mode     
#> RJSONIO    38   -none- character
#> jsonlite 1024   -none- character
summary(package_dependencies(c("RJSONIO", "jsonlite"), reverse = T, recursive = T))
#>          Length Class  Mode     
#> RJSONIO    63   -none- character
#> jsonlite 4803   -none- character

Created on 2022-07-20 by the reprex package (v2.0.1)

Looking at the source code of ROhdsiWebApi it appears that RJSONIO is used only to translate R objects to and from json which I think jsonlite provides with jsonlite::fromJSON(), jsonlite::toJSON().

image
schuemie commented 2 years ago

Great work @gkovaig ! I seem to remember there are subtle differences between RJSONIO and jsonlite which means we sometimes need one and sometimes the other. But I don't remember the details, and I think it is good to revisit the discussion, and if possible choose one.

Will you be joining the HADES meeting tomorrow? It would be good to let people know you're working on this, and maybe brainstorm some ideas.

gkovaig commented 2 years ago

Thank you @schuemie - appears the meeting is at 9am PST. Can join for the first 25 minutes.

schuemie commented 2 years ago

Making a note to also explore the pkgndep package as mentioned at our last HADES meeting.

gkovaig commented 2 years ago

Thank you @schuemie - chatted with Edward Burns right after our meeting and obtained this starter script.

library(pkgndep) pkg <- pkgndep(here::here()) plot(pkg, fix_size = FALSE) heaviness(pkg)

Based on the discussion at this meeting, task #1 is to obtain a comprehensive list of dependencies, with recursion. Optionally include reverse dependency counts.

Potential candidates are: base::available.packages() tools::package_dependencies() mvbutils::pkgndep()

Scope is all packages in OHDSI/Hades. Potential artifacts: (a) a data frame capturing the relationships (b) summaries

Task # 2 could be zeroing in on these potential candidates for consolidation. Obtain detailed usage in HADES and 'popularity' (via reverse dependencies) in other CRAN packages (based on pointers from @ablack3 ) read_csv -> readr devtools -> remotes RCurl -> curl rJSON JSONIO -> jsonlite dataframe -> tibble sqlite -> arrow (via Andromeda)

p.s. should we move these notes and discussion into Teams for easier editing, commenting, etc.?

ablack3 commented 2 years ago

should we move these notes and discussion into Teams for easier editing, commenting, etc.?

My vote would be to keep the progress updates and topic specific discussion in this thread and use Teams for 1:1 communication or for discussion with the broader OSC group or Hades group.

With regard to dependencies we should also include Java dependencies as well since several Hades package depend on Java packages. In at least one case, a Hades package depends on another Hades package through Java but not R. I think having careful strategy about the interaction between in Java and R components and how packages in one language depend on packages in the other would be quite helpful.

For example:

The DatabaseConnector and SqlRender repositories contain both R and Java packages combined together.

The SqlRender java package is imported into WebAPI and pulled in from repo.ohdsi.org I think. I think that credentials needed to see the packages in the OHDSI nexus repo right @leeevans? https://repo.ohdsi.org/nexus/ However WebAPI is not part of Hades so maybe out of scope for this discussion.

CirceR is part of Hades and contains compiled java code that is from the circe Java package that is in a separate repo called circe-be.

The circe java package also depends on SqlRender so I think it is fair to say that CirceR depends on SqlRender even though that dependency is not listed in the R package DESCRIPTION file.

Are all the jar files in CirceR built from the source code in circe-be? I'm not sure.

Each of the ohdsi java packages have their own java package dependencies.

schuemie commented 2 years ago

Although I agree mapping Java dependencies is a nice to have, I don't want to scope to creep too much. My original reason for starting this thread was to get a handle on our external dependencies to

  1. Increase stability of HADES by only depending on reliable third-party packages.
  2. Reduce the HADES install footprint by removing dependencies that can be easily replaced.

With regards to the second point, I think the biggest opportunities may not come from consolidating multiple packages that do the same thing, but rather by identifying dependencies that can very easily be removed. An example that comes to mind is the fact that DatabaseConnector used to depend on the uuid package. The only reason for that was one point in the code where random strings need to be generated. Generating random strings can easily be done with base R functions, so it was easy to drop this dependency that no other HADES package had. In this case it didn't have a big impact, since uuid doesn't depend on anything else, but I'm sure we can identify dependencies where removing them will greatly reduce the number of unique dependencies of HADES.

gkovaig commented 2 years ago

@schuemie @ablack3 I have forked OHDSI/Hades and created a Rmd notebook in the extras folder https://github.com/gkovaig/Hades/blob/main/extras/OHDSI_Hades_R_Package_Dependencies.Rmd.

Next step: find a way to parse DESCRIPTION of non-CRAN packages in Hades and append them to a dataframe of metadata from CRAN packages, then run through tools::package_dependencies() .

Appreciate if you can eye-ball this to see if I'm headed in the right direction and captured your intent correctly.

ablack3 commented 2 years ago

Hi @gkovaig, I think you're correct that the first step is to create a list of all Hades R package dependencies. I think the code example above that uses the OHDSI drat repository might capture all of them but you should double check.

The second step is to review the complete list of dependencies an identify pairs of packages that provide overlapping functionality (for example rjsonlite and RJSONIO).

It would be helpful to have some metrics on how heavily packages are used.

For each Hades dependency we would like to know

Read the pkgndep vignettes and explore the heaviness of Hades R package dependencies. Co-heaviness of two parent packages Check the Heaviness of Package Dependencies Suggestions for optimizing package dependencies

These are some ideas.

gkovaig commented 2 years ago

@ablack3 Thank you for the detailed notes and hints. I went ahead and extended the code to use pkgndep to examine heaviness and co-dependencies in a matrix format. I also added an example of RJSONIO usage.

Updates are in https://github.com/gkovaig/Hades/blob/main/extras/OHDSI_Hades_R_Package_Dependencies.Rmd. There are also two more artifacts in the same 'extras' folder, for the 'CSV' and the 'HTML' produced by knitr.

Some observations:

  1. Parent packages PatientLevelPrediction and EnsemblePatientLevelPrediction are the heaviest. Also, their co-heaviness larger than 10. The co-heaviness measures the number of additional dependencies that two parent packages simultaneously import and are only imported by the two parents. It would be useful to treat these two as outliers and focus on the other 22 packages first.

  2. Additionally, there are about 25 packages that are used only in EvidenceSynthesis and another 7 that are exclusive to ROhdsiWebApi. Dependencies specific to these two packages may need to be dealt with separately.

Specifically, I tried to use renv to get deeper into usage of specific functions of the dependent packages; apparently they are not tracked at the function-level. So we may have to use IDEs like pyCharm or VS-Code or other command-line parsers for this purpose.

Hope this is a good starting point. Looking forward to your comments.

schuemie commented 2 years ago

This is great! Could you share some more information on all HADES packages? (like heaviness per HADES package) I think pkgndep has some nice visualizations?

gkovaig commented 2 years ago

@schuemie I was expecting them to be embedded into the main 'knitted' html - apparently not 😟

I now added to heaviness visualizations, one for all-Hades, another for CirceR only in the same folder.

schuemie commented 2 years ago

Awesome! I've uploaded the all-HADES page here for now, for easy viewing.

The two PLP packages bring in a lot of dependencies unique to those two, which doesn't surprise me. What does surprise me is the heaviness of EvidenceSynthesis. If you look at its list of imports, it's not obvious where this comes from. Is there a way to analyze what causes this explosion of dependencies?

gkovaig commented 2 years ago

Thank you @schuemie . I added a viz https://github.com/gkovaig/Hades/blob/5f0045af687d3ab5d2c8a5b27dc57d7e3b7b0010/extras/OHDSI_Hades_R_Package_Dependencies.html for EvidenceSynthesis based on pkgdepR. This seems to add the function-level dependencies that @ablack3 requested earlier. The viz itself is interactive, so not sure if it renders in a useful way in the knitted HTML; in which case, you can re-run the last R code block in the Rmd file https://github.com/gkovaig/Hades/blob/5f0045af687d3ab5d2c8a5b27dc57d7e3b7b0010/extras/OHDSI_Hades_R_Package_Dependencies.Rmd.

schuemie commented 2 years ago

Thanks! Although I'm not sure how this visualization helps me understand where EvidenceSynthesis' heaviness comes from.

Some manual digging suggests most of these unique dependencies come through the meta package. This package is used exactly once, here. I could easily rewrite the code to drop the dependency on meta.

gkovaig commented 2 years ago

@schuemie You are right - the pkgdepR viz only brings up inter-dependencies between functions within EvidenceSynthesis, and not to specific calls to external packages like meta.

Came across this. So, with a stub like this: library(meta) library(NCmisc) NCmisc::list.functions.in.file(filename = '/Users/rama0/projects/EvidenceSynthesis/R/FixedEffectMetaAnalysis.R')

one could scan all R programs in a particular repo and collect all references to the external package (e.g. meta) and the specific functions within that external package. In this example, the only function called from meta is identified as metagen.

As a bonus, this could also help identify potential namespace conflicts.

Downside: the repo in question must be cloned in the local filesystem, all referenced external packages must be installed and loaded into memory before this can be run.

Would it be useful to dig further into this approach?

ablack3 commented 2 years ago

@gkovaig This package might be helpful for enumerating function use https://brshallo.github.io/funspotr/