msevestre commented 2 years ago

We have introduced some performant code using dplyr, purr, tidyr packages. But usages of said packages add complexity to the code, specifically for team member like me, who do not know how to use those packages.

we need to document (and it's ok to refer to the documentation if it is clear) the following methods and idiomatic R syntax (To be extended)

Methods

dplyr::mutate
dplyr::across
dplyr::matches
dplyr::bind_rows
purrrr::map (used in purrr::map(yDataList, ~ .yUnitConverter(.x, yTargetUnit)))
Syntax
what is ~ and.x in .fns = ~ tidyr::replace_na(.x, "missing") or purrr::map(yDataList, ~ .yUnitConverter(.x, yTargetUnit))

Why sometime use %>% and sometime assignment e.g.

data <- dplyr::arrange(data, .rowidInternal)

 data <- dplyr::select(data, -dplyr::matches("Split$|.rowidInternal"))

vs (might not be syntactically correct)

data  
%>% 
  dplyr::arrange(.rowidInternal)
%>% 
dplyr::select(-dplyr::matches("Split$|.rowidInternal"))

IndrajeetPatil commented 2 years ago

Why sometime use %>% and sometime assignment

There is no reason to prefer one way or another. They are identical, and what is preferred is just a matter of personal taste. That said, long piped chains can be difficult to debug.

IndrajeetPatil commented 2 years ago

what is ~ and .x in .fns = ~ tidyr::replace_na(.x, "missing") or purrr::map(yDataList, ~ .yUnitConverter(.x, yTargetUnit))

~ is lambda syntax to create anonymous functions.
.x is a pronoun that refers to data.

So, in the following:

purrr::keep(list_of_data_frames, ~ nrow(.x) > 0L)

.x pronoun refers to list_of_data_frames, while ~ nrow(.x) > 0L creates an anonymous function.

If you didn't want to use an anonymous function, you will have to create a named function:

checkEmptyDataFrame <- function(data) nrow(data) > 0L
purrr::keep(list_of_data_frames, checkEmptyDataFrame)

IndrajeetPatil commented 2 years ago

Methods

I don't see the point in documenting them when they are already comprehensively documented on their respective package websites: https://www.tidyverse.org/packages/

IndrajeetPatil commented 2 years ago

Btw, tidyverse is not the only 3rd part packages we are using. We are also using rClr.

If we are going to document dplyr:: methods, we should also be documenting all used instances of rClr:: methods because I don't understand what this code is doing, and its usage in the code is not always accompanied by comments.

This will also be helpful for future team members naive to rClr.

msevestre commented 2 years ago

because I don't understand what this code is doing,

Are you serious or just being sarcastic?

But let me do it in any case:

rClr::clrSet
rClr::clrGet
rClr::clrCallStatic
rClr::clrCall
rClr:: clrLoadAssembly
rClr::clrNew

I believe those are the 5 methods that we are calling. The name of the method and the way they are being called seem for me to be pretty clear but if not, let's clarify

rClr is a wrapper so it just delegates to the underlying .NET classes. This is what it does. This is the only thing that it does. So with that in mind, let's check those methods:

rClr::clrLoadAssembly(filePathFor("OSPSuite.R.dll")) loads an assembly in memory
rClr::clrGet calls a get property. of an object and returns the value rClr::clrGet(parameter, "Value") will call the get property Value of the parameter and return the value
rClr::clrSet calls the set property. of an object and returns the value rClr::clrSet(parameter, "Value". 5) will call the set property Value of the parameter and set the value 5
rClr::clrCall calls a method on an object. Extra parameters are all the parameter in order required by the API rClr::clrCall(sensitivityAnalysisTask, "ExportResultsToCSV", results, results$simulation, filePath) will call the method ExportResultsToCSV of the sensitivityAnalysisTask object and pass it 3 parameters (results, simulation and filePath)
rClr::clrNew instantiate a new class rClr::clrNew("PKSim.Core.Snapshots.Parameter") will create a new instance class of type PKSim.Core.Snapshots.Parameter

I do not think we need to document those lines in the code as they read exactly as what they do. No magic. Just plain code

Hopefully this clarifies what this wrapper is doing

msevestre commented 2 years ago

I don't see the point in documenting them

I disagree that's why I think we should make it our own simplified version Let's look at an example shall we?

https://dplyr.tidyverse.org/reference/arrange.html

Very helpful. Then I need to scroll for a while to see an example

Seems easy enough. ok

A bit further

Already no idea

The problem is this with the doc. It documents the how but not the WHY. It does not say what the code is doing.

Arrange is probably the easiest of those methods BTW. And we can agree to disagree. I am not asking you to do it. I will do it myself when I have time

IndrajeetPatil commented 2 years ago

Are you serious or just being sarcastic?

I wasn't being sarcastic. I truly found it difficult, and it's quite handy to have the reference you have posted here.

IndrajeetPatil commented 2 years ago

I will get rid of purrr's lambda syntax.

msevestre commented 2 years ago

No no. We just need to know what it means

msevestre commented 2 years ago

In fact, can we use this syntax instead of function (x){x+2} Or is this only when used as argument of other functions?

We (all but you) need to get better at using this thing

IndrajeetPatil commented 2 years ago

We (all but you) need to get better at using this thing

Maybe, but in esqLABS R-devel meeting I was informed to only use base-R as much as possible because most people don't know tidyverse and reviewing and maintaining this code can be a challenge in the future. So this might be a good opportunity for me to reduce tidyverse usage as much as possible. I can do it one PR at a time.

msevestre commented 2 years ago

Let's talk about this with @PavelBal next time. As far as I am concerned, the decision made for esqlabs have no bearing on what we do here. I personally think we should make sure our code is easy to understand and I don't necessarily believe that base R always achieve this.

IndrajeetPatil commented 2 years ago

Okay, I have two PRs now that demonstrate my internal conflict on what is the expected modus operandi for me:

892 - with tidyverse

876 - without tidyverse

We can discuss which one to merge, and that sets a precedent for a lot to come.

PavelBal commented 2 years ago

My opinion:

Use base R if it can solve the problem with the same complexity (sometimes I even think the code in base R looks less complex) as tidyverse.
If tidyverse allows to solve the same problem with less code (but same readability!) or higher performance - use tidyverse
what is the benefit of using ~ over \(x) ?
always use {} also in anonymous functions, so it is clear where the function starts and where it ends
Do not use pipes unless it is a very simple expression. Why? It is much (!!!) harder to debug and profile code that uses pipes compared to using assignments for intermediate steps. I tried to profile such code once and actually have no idea how to do. The same goes for debugging - if multiple steps are piped together, I don't know how to separate them to see where the bug is introduced.

PavelBal commented 2 years ago

Here I am already confused: why do we need the lambda function in purrr::keep(list_of_data_frames, ~ nrow(.x) > 0L) (well actually it does make sense for me), but not for dplyr::filter(data, x > 5)?

IndrajeetPatil commented 2 years ago

Gross simplication: purrr is used for working with lists; dplyr with data frames.

purrr::keep() is working with a list, and a lambda function is needed to apply it to each element of the list. Similar to:

.removeEmptyDataFrame <- function(x) Filter(function(data) nrow(data) > 0L, x)

You don't need to use lambda syntax, you can just use an anonymous function, but lambda is more idiomatic of a functional programming tool like purrr:

purrr::keep(list_of_data_frames, function(data) nrow(data) > 0L)

dplyr::filter() is using x > 5 condition, and not a function, to filter out parts of a data frame

dplyr::filter(data, x > 5)

IndrajeetPatil commented 2 years ago

what is the benefit of using ~ over (x) ?

\(x) syntax for anonymous functions was introduced only a year ago in R 4.1. So if you want to use, you will have to bump minimum R version for this package to 4.1. The lambda syntax works with all versions of R > 3.4 (at the minimum).

Plus, R users are intimately familiar with ~ via statistical models (e.g. lm(wt ~ mpg, mtcars)).

IndrajeetPatil commented 2 years ago

If you are looking for dplyr to base translation: https://dplyr.tidyverse.org/articles/base.html

IndrajeetPatil commented 2 years ago

always use {} also in anonymous functions, so it is clear where the function starts and where it ends

I disagree.

But, if this is the standard we want to adopt in OSP R code, then it should be part of our R coding standards.

Because both styler and tidyverse style guide find the non-braced syntax for anonymous function to be acceptable under their guidelines:

styler::style_text("(function(x, y) z <- x^2 + y^2)(0:7, 1)")
#> (function(x, y) z <- x^2 + y^2)(0:7, 1)

^{Created on 2022-04-25 by the reprex package (v2.0.1.9000)}

In fact, it'd be highly unusual to use {} for anonymous functions in tidyverse workflows if there is only a single statement. You can see this on display in every use of anonymous functions on their websites. E.g.

msevestre commented 2 years ago

Working with df, dplyr seems to be a clear winner.

lm(wt ~ mpg, mtcars)

Even with all my ~ knowledge (as of Friday :)) , I am not quite sure what this does. This is a lambda but .x is not used?

IndrajeetPatil commented 2 years ago

Even with all my ~ knowledge (as of Friday :)) , I am not quite sure what this does. This is a lambda but .x is not used?

No, no. I meant to say that the lambda syntax was implemented using ~ because R users were already familiar with it via statistical modeling.

lm(wt ~ mpg, mtcars) is read as "linearly regress wt variable on mpg variable".

IndrajeetPatil commented 2 years ago

Systematic comparison of tidy and base-R equivalent functions.

At least, for a reasonably sized data frame, the tidy solutions are faster and consume less amount of memory. Using or not using the pipe doesn't make much of a difference.

library(dplyr, warn.conflicts = FALSE)

set.seed(123)
x <- rnorm(1e6)
df <- data.frame(x = x)

# dplyr with pipe
df_tidy_pipe <- function(data) {
  data %>%
    mutate(x_sq = x^2) %>%
    filter(x > 0) %>%
    arrange(x_sq) %>%
    pull(x_sq)
}

# dplyr without pipe
df_tidy_nopipe <- function(data) {
  data <- mutate(data, x_sq = x^2)
  data <- filter(data, x > 0)
  data <- arrange(data, x_sq)
  data <- pull(data, x_sq)
  return(data)
}

# base-R
df_base <- function(data) {
  data <- transform(data, x_sq = x^2)
  data <- subset(data, x > 0)
  data <- data[order(data$x_sq), , drop = FALSE]
  data <- data[["x_sq"]]
  return(data)
}

bench::mark(
  "base" = df_base(df),
  "tidy - pipe" = df_tidy_pipe(df),
  "tidy - no pipe" = df_tidy_nopipe(df),
  check = TRUE
)[1:8]
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base             46.6ms   55.5ms      14.8    69.1MB     38.7
#> 2 tidy - pipe        24ms   28.1ms      29.9    44.4MB     43.0
#> 3 tidy - no pipe   24.7ms   26.6ms      33.1      42MB     42.8

^{Created on 2022-04-28 by the reprex package (v2.0.1)}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.0 (2022-04-22) #> os macOS Monterey 12.3 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Berlin #> date 2022-04-28 #> pandoc 2.18 @ /usr/local/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> bench 1.1.2 2021-11-30 [1] CRAN (R 4.2.0) #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> dplyr * 1.0.8 2022-02-08 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0) #> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> profmem 0.6.0 2020-12-13 [1] CRAN (R 4.2.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.2.0) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.2.0) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0) #> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0) #> styler 1.7.0.9001 2022-04-23 [1] Github (r-lib/styler@81393ba) #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.2.0) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.30 2022-03-02 [1] CRAN (R 4.2.0) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

IndrajeetPatil commented 2 years ago

@msevestre To answer your question in the PR:

Whether a pipe is used or not will not change how many copies are created. So, if I ever said that, I was clearly in the wrong.

library(dplyr, warn.conflicts = FALSE)

# with pipe

df <- data.frame(x = c(1, 2))

tracemem(df)
#> [1] "<0x106169288>"
df %>%
  mutate(x_sq = x^2) %>%
  filter(x > 1) %>%
  pull(x)
#> tracemem[0x106169288 -> 0x130a4f378]: initialize <Anonymous> mutate_cols mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x130a4f378 -> 0x130988bd0]: initialize <Anonymous> mutate_cols mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x106169288 -> 0x1054460e0]: new_data_frame vec_data dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x1054460e0 -> 0x105445f20]: new_data_frame dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x105445f20 -> 0x105445e40]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> [1] 2
untracemem(df)

# without pipe

df2 <- data.frame(x = c(1, 2))

tracemem(df2)
#> [1] "<0x105941f60>"
df2 <- mutate(df2, x_sq = x^2)
#> tracemem[0x105941f60 -> 0x10716f648]: initialize <Anonymous> mutate_cols mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x10716f648 -> 0x10716f568]: initialize <Anonymous> mutate_cols mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x105941f60 -> 0x1072b6ae0]: new_data_frame vec_data dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x1072b6ae0 -> 0x1072b68b0]: new_data_frame dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x1072b68b0 -> 0x1072b67d0]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
df2 <- filter(df2, x > 1)
pull(df2, x)
#> [1] 2
untracemem(df2)

^{Created on 2022-04-28 by the reprex package (v2.0.1)}

For me, the benefit of pipe is mostly from a code readability perspective.

I find this:

data %>%
  f() %>%
  g() %>%
  h()

to be in line with my internal monologue

Take data and then (%>%) do f() and then do g() etc.

But, it's fine not to use it in the package development context where the team might not be familiar with it and where it might impose challenges for debugging.

Open-Systems-Pharmacology / OSPSuite-R

Create wiki explaining some of the 3rd party methods that we are using #890

Methods

Syntax

892 - with tidyverse

876 - without tidyverse