Open-Systems-Pharmacology / OSPSuite-R

R package for the OSPSuite
https://www.open-systems-pharmacology.org/OSPSuite-R/
Other
29 stars 12 forks source link

Create wiki explaining some of the 3rd party methods that we are using #890

Open msevestre opened 2 years ago

msevestre commented 2 years ago

We have introduced some performant code using dplyr, purr, tidyr packages. But usages of said packages add complexity to the code, specifically for team member like me, who do not know how to use those packages.

we need to document (and it's ok to refer to the documentation if it is clear) the following methods and idiomatic R syntax (To be extended)

Methods

Why sometime use %>% and sometime assignment e.g.

data <- dplyr::arrange(data, .rowidInternal)

 data <- dplyr::select(data, -dplyr::matches("Split$|.rowidInternal"))

vs (might not be syntactically correct)

data  
%>% 
  dplyr::arrange(.rowidInternal)
%>% 
dplyr::select(-dplyr::matches("Split$|.rowidInternal"))
IndrajeetPatil commented 2 years ago

Why sometime use %>% and sometime assignment

There is no reason to prefer one way or another. They are identical, and what is preferred is just a matter of personal taste. That said, long piped chains can be difficult to debug.

IndrajeetPatil commented 2 years ago

what is ~ and .x in .fns = ~ tidyr::replace_na(.x, "missing") or purrr::map(yDataList, ~ .yUnitConverter(.x, yTargetUnit))

So, in the following:

purrr::keep(list_of_data_frames, ~ nrow(.x) > 0L)

.x pronoun refers to list_of_data_frames, while ~ nrow(.x) > 0L creates an anonymous function.

If you didn't want to use an anonymous function, you will have to create a named function:

checkEmptyDataFrame <- function(data) nrow(data) > 0L
purrr::keep(list_of_data_frames, checkEmptyDataFrame)
IndrajeetPatil commented 2 years ago

Methods

I don't see the point in documenting them when they are already comprehensively documented on their respective package websites: https://www.tidyverse.org/packages/

IndrajeetPatil commented 2 years ago

Btw, tidyverse is not the only 3rd part packages we are using. We are also using rClr.

If we are going to document dplyr:: methods, we should also be documenting all used instances of rClr:: methods because I don't understand what this code is doing, and its usage in the code is not always accompanied by comments.

This will also be helpful for future team members naive to rClr.

msevestre commented 2 years ago

because I don't understand what this code is doing,

Are you serious or just being sarcastic?

But let me do it in any case:

rClr::clrSet
rClr::clrGet
rClr::clrCallStatic
rClr::clrCall
rClr:: clrLoadAssembly
rClr::clrNew

I believe those are the 5 methods that we are calling. The name of the method and the way they are being called seem for me to be pretty clear but if not, let's clarify

rClr is a wrapper so it just delegates to the underlying .NET classes. This is what it does. This is the only thing that it does. So with that in mind, let's check those methods:

I do not think we need to document those lines in the code as they read exactly as what they do. No magic. Just plain code

Hopefully this clarifies what this wrapper is doing

msevestre commented 2 years ago

I don't see the point in documenting them

I disagree that's why I think we should make it our own simplified version Let's look at an example shall we?

https://dplyr.tidyverse.org/reference/arrange.html image

Very helpful. Then I need to scroll for a while to see an example

image

Seems easy enough. ok

A bit further

image

Already no idea

The problem is this with the doc. It documents the how but not the WHY. It does not say what the code is doing.

Arrange is probably the easiest of those methods BTW. And we can agree to disagree. I am not asking you to do it. I will do it myself when I have time

IndrajeetPatil commented 2 years ago

Are you serious or just being sarcastic?

I wasn't being sarcastic. I truly found it difficult, and it's quite handy to have the reference you have posted here.

IndrajeetPatil commented 2 years ago

I will get rid of purrr's lambda syntax.

msevestre commented 2 years ago

No no. We just need to know what it means

msevestre commented 2 years ago

In fact, can we use this syntax instead of function (x){x+2} Or is this only when used as argument of other functions?

We (all but you) need to get better at using this thing

IndrajeetPatil commented 2 years ago

We (all but you) need to get better at using this thing

Maybe, but in esqLABS R-devel meeting I was informed to only use base-R as much as possible because most people don't know tidyverse and reviewing and maintaining this code can be a challenge in the future. So this might be a good opportunity for me to reduce tidyverse usage as much as possible. I can do it one PR at a time.

msevestre commented 2 years ago

Let's talk about this with @PavelBal next time. As far as I am concerned, the decision made for esqlabs have no bearing on what we do here. I personally think we should make sure our code is easy to understand and I don't necessarily believe that base R always achieve this.

IndrajeetPatil commented 2 years ago

Okay, I have two PRs now that demonstrate my internal conflict on what is the expected modus operandi for me:

892 - with tidyverse

876 - without tidyverse

We can discuss which one to merge, and that sets a precedent for a lot to come.

PavelBal commented 2 years ago

My opinion:

PavelBal commented 2 years ago

Here I am already confused: why do we need the lambda function in purrr::keep(list_of_data_frames, ~ nrow(.x) > 0L) (well actually it does make sense for me), but not for dplyr::filter(data, x > 5)?

IndrajeetPatil commented 2 years ago

Gross simplication: purrr is used for working with lists; dplyr with data frames.

.removeEmptyDataFrame <- function(x) Filter(function(data) nrow(data) > 0L, x)

You don't need to use lambda syntax, you can just use an anonymous function, but lambda is more idiomatic of a functional programming tool like purrr:

purrr::keep(list_of_data_frames, function(data) nrow(data) > 0L)
dplyr::filter(data, x > 5)
IndrajeetPatil commented 2 years ago

what is the benefit of using ~ over (x) ?

\(x) syntax for anonymous functions was introduced only a year ago in R 4.1. So if you want to use, you will have to bump minimum R version for this package to 4.1. The lambda syntax works with all versions of R > 3.4 (at the minimum).

Plus, R users are intimately familiar with ~ via statistical models (e.g. lm(wt ~ mpg, mtcars)).

IndrajeetPatil commented 2 years ago

If you are looking for dplyr to base translation: https://dplyr.tidyverse.org/articles/base.html

image

IndrajeetPatil commented 2 years ago

always use {} also in anonymous functions, so it is clear where the function starts and where it ends

I disagree.

But, if this is the standard we want to adopt in OSP R code, then it should be part of our R coding standards.

Because both styler and tidyverse style guide find the non-braced syntax for anonymous function to be acceptable under their guidelines:

styler::style_text("(function(x, y) z <- x^2 + y^2)(0:7, 1)")
#> (function(x, y) z <- x^2 + y^2)(0:7, 1)

Created on 2022-04-25 by the reprex package (v2.0.1.9000)

In fact, it'd be highly unusual to use {} for anonymous functions in tidyverse workflows if there is only a single statement. You can see this on display in every use of anonymous functions on their websites. E.g.

image

msevestre commented 2 years ago

Working with df, dplyr seems to be a clear winner.

lm(wt ~ mpg, mtcars)

Even with all my ~ knowledge (as of Friday :)) , I am not quite sure what this does. This is a lambda but .x is not used?

IndrajeetPatil commented 2 years ago

Even with all my ~ knowledge (as of Friday :)) , I am not quite sure what this does. This is a lambda but .x is not used?

No, no. I meant to say that the lambda syntax was implemented using ~ because R users were already familiar with it via statistical modeling.

lm(wt ~ mpg, mtcars) is read as "linearly regress wt variable on mpg variable".

IndrajeetPatil commented 2 years ago

Systematic comparison of tidy and base-R equivalent functions.

At least, for a reasonably sized data frame, the tidy solutions are faster and consume less amount of memory. Using or not using the pipe doesn't make much of a difference.

library(dplyr, warn.conflicts = FALSE)

set.seed(123)
x <- rnorm(1e6)
df <- data.frame(x = x)

# dplyr with pipe
df_tidy_pipe <- function(data) {
  data %>%
    mutate(x_sq = x^2) %>%
    filter(x > 0) %>%
    arrange(x_sq) %>%
    pull(x_sq)
}

# dplyr without pipe
df_tidy_nopipe <- function(data) {
  data <- mutate(data, x_sq = x^2)
  data <- filter(data, x > 0)
  data <- arrange(data, x_sq)
  data <- pull(data, x_sq)
  return(data)
}

# base-R
df_base <- function(data) {
  data <- transform(data, x_sq = x^2)
  data <- subset(data, x > 0)
  data <- data[order(data$x_sq), , drop = FALSE]
  data <- data[["x_sq"]]
  return(data)
}

bench::mark(
  "base" = df_base(df),
  "tidy - pipe" = df_tidy_pipe(df),
  "tidy - no pipe" = df_tidy_nopipe(df),
  check = TRUE
)[1:8]
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base             46.6ms   55.5ms      14.8    69.1MB     38.7
#> 2 tidy - pipe        24ms   28.1ms      29.9    44.4MB     43.0
#> 3 tidy - no pipe   24.7ms   26.6ms      33.1      42MB     42.8

Created on 2022-04-28 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.0 (2022-04-22) #> os macOS Monterey 12.3 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Berlin #> date 2022-04-28 #> pandoc 2.18 @ /usr/local/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> bench 1.1.2 2021-11-30 [1] CRAN (R 4.2.0) #> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0) #> dplyr * 1.0.8 2022-02-08 [1] CRAN (R 4.2.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0) #> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> profmem 0.6.0 2020-12-13 [1] CRAN (R 4.2.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.2.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.2.0) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.2.0) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0) #> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0) #> styler 1.7.0.9001 2022-04-23 [1] Github (r-lib/styler@81393ba) #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.2.0) #> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.30 2022-03-02 [1] CRAN (R 4.2.0) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
IndrajeetPatil commented 2 years ago

@msevestre To answer your question in the PR:

Whether a pipe is used or not will not change how many copies are created. So, if I ever said that, I was clearly in the wrong.

library(dplyr, warn.conflicts = FALSE)

# with pipe

df <- data.frame(x = c(1, 2))

tracemem(df)
#> [1] "<0x106169288>"
df %>%
  mutate(x_sq = x^2) %>%
  filter(x > 1) %>%
  pull(x)
#> tracemem[0x106169288 -> 0x130a4f378]: initialize <Anonymous> mutate_cols mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x130a4f378 -> 0x130988bd0]: initialize <Anonymous> mutate_cols mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x106169288 -> 0x1054460e0]: new_data_frame vec_data dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x1054460e0 -> 0x105445f20]: new_data_frame dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x105445f20 -> 0x105445e40]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> [1] 2
untracemem(df)

# without pipe

df2 <- data.frame(x = c(1, 2))

tracemem(df2)
#> [1] "<0x105941f60>"
df2 <- mutate(df2, x_sq = x^2)
#> tracemem[0x105941f60 -> 0x10716f648]: initialize <Anonymous> mutate_cols mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x10716f648 -> 0x10716f568]: initialize <Anonymous> mutate_cols mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x105941f60 -> 0x1072b6ae0]: new_data_frame vec_data dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x1072b6ae0 -> 0x1072b68b0]: new_data_frame dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch 
#> tracemem[0x1072b68b0 -> 0x1072b67d0]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
df2 <- filter(df2, x > 1)
pull(df2, x)
#> [1] 2
untracemem(df2)

Created on 2022-04-28 by the reprex package (v2.0.1)

For me, the benefit of pipe is mostly from a code readability perspective.

I find this:

data %>%
  f() %>%
  g() %>%
  h()

to be in line with my internal monologue

Take data and then (%>%) do f() and then do g() etc.

But, it's fine not to use it in the package development context where the team might not be familiar with it and where it might impose challenges for debugging.