Open msevestre opened 2 years ago
Why sometime use %>% and sometime assignment
There is no reason to prefer one way or another. They are identical, and what is preferred is just a matter of personal taste. That said, long piped chains can be difficult to debug.
what is ~ and .x in .fns = ~ tidyr::replace_na(.x, "missing") or purrr::map(yDataList, ~ .yUnitConverter(.x, yTargetUnit))
~
is lambda syntax to create anonymous functions. .x
is a pronoun that refers to data.So, in the following:
purrr::keep(list_of_data_frames, ~ nrow(.x) > 0L)
.x
pronoun refers to list_of_data_frames
, while ~ nrow(.x) > 0L
creates an anonymous function.
If you didn't want to use an anonymous function, you will have to create a named function:
checkEmptyDataFrame <- function(data) nrow(data) > 0L
purrr::keep(list_of_data_frames, checkEmptyDataFrame)
Methods
I don't see the point in documenting them when they are already comprehensively documented on their respective package websites: https://www.tidyverse.org/packages/
Btw, tidyverse is not the only 3rd part packages we are using. We are also using rClr.
If we are going to document dplyr::
methods, we should also be documenting all used instances of rClr::
methods because I don't understand what this code is doing, and its usage in the code is not always accompanied by comments.
This will also be helpful for future team members naive to rClr.
because I don't understand what this code is doing,
Are you serious or just being sarcastic?
But let me do it in any case:
rClr::clrSet
rClr::clrGet
rClr::clrCallStatic
rClr::clrCall
rClr:: clrLoadAssembly
rClr::clrNew
I believe those are the 5 methods that we are calling. The name of the method and the way they are being called seem for me to be pretty clear but if not, let's clarify
rClr is a wrapper so it just delegates to the underlying .NET classes. This is what it does. This is the only thing that it does. So with that in mind, let's check those methods:
rClr::clrLoadAssembly(filePathFor("OSPSuite.R.dll"))
loads an assembly in memory
rClr::clrGet
calls a get property. of an object and returns the value
rClr::clrGet(parameter, "Value")
will call the get property Value
of the parameter
and return the value
rClr::clrSet
calls the set property. of an object and returns the value
rClr::clrSet(parameter, "Value". 5)
will call the set property Value
of the parameter
and set the value 5
rClr::clrCall
calls a method on an object. Extra parameters are all the parameter in order required by the API
rClr::clrCall(sensitivityAnalysisTask, "ExportResultsToCSV", results, results$simulation, filePath)
will call the method ExportResultsToCSV
of the sensitivityAnalysisTask
object and pass it 3 parameters (results, simulation and filePath)
rClr::clrNew
instantiate a new class
rClr::clrNew("PKSim.Core.Snapshots.Parameter") will create a new instance class of type PKSim.Core.Snapshots.Parameter
I do not think we need to document those lines in the code as they read exactly as what they do. No magic. Just plain code
Hopefully this clarifies what this wrapper is doing
I don't see the point in documenting them
I disagree that's why I think we should make it our own simplified version Let's look at an example shall we?
https://dplyr.tidyverse.org/reference/arrange.html
Very helpful. Then I need to scroll for a while to see an example
Seems easy enough. ok
A bit further
Already no idea
The problem is this with the doc. It documents the how but not the WHY. It does not say what the code is doing.
Arrange is probably the easiest of those methods BTW. And we can agree to disagree. I am not asking you to do it. I will do it myself when I have time
Are you serious or just being sarcastic?
I wasn't being sarcastic. I truly found it difficult, and it's quite handy to have the reference you have posted here.
I will get rid of purrr's lambda syntax.
No no. We just need to know what it means
In fact, can we use this syntax instead of function (x){x+2} Or is this only when used as argument of other functions?
We (all but you) need to get better at using this thing
We (all but you) need to get better at using this thing
Maybe, but in esqLABS R-devel meeting I was informed to only use base-R as much as possible because most people don't know tidyverse and reviewing and maintaining this code can be a challenge in the future. So this might be a good opportunity for me to reduce tidyverse usage as much as possible. I can do it one PR at a time.
Let's talk about this with @PavelBal next time. As far as I am concerned, the decision made for esqlabs have no bearing on what we do here. I personally think we should make sure our code is easy to understand and I don't necessarily believe that base R always achieve this.
Okay, I have two PRs now that demonstrate my internal conflict on what is the expected modus operandi for me:
We can discuss which one to merge, and that sets a precedent for a lot to come.
My opinion:
~
over \(x)
?{}
also in anonymous functions, so it is clear where the function starts and where it endsHere I am already confused: why do we need the lambda function in purrr::keep(list_of_data_frames, ~ nrow(.x) > 0L)
(well actually it does make sense for me), but not for dplyr::filter(data, x > 5)
?
Gross simplication: purrr is used for working with lists; dplyr with data frames.
purrr::keep()
is working with a list, and a lambda function is needed to apply it to each element of the list. Similar to:.removeEmptyDataFrame <- function(x) Filter(function(data) nrow(data) > 0L, x)
You don't need to use lambda syntax, you can just use an anonymous function, but lambda is more idiomatic of a functional programming tool like purrr:
purrr::keep(list_of_data_frames, function(data) nrow(data) > 0L)
dplyr::filter()
is using x > 5
condition, and not a function, to filter out parts of a data framedplyr::filter(data, x > 5)
what is the benefit of using ~ over (x) ?
\(x)
syntax for anonymous functions was introduced only a year ago in R 4.1
. So if you want to use, you will have to bump minimum R version for this package to 4.1
. The lambda syntax works with all versions of R > 3.4
(at the minimum).
Plus, R users are intimately familiar with ~
via statistical models (e.g. lm(wt ~ mpg, mtcars)
).
If you are looking for dplyr to base translation: https://dplyr.tidyverse.org/articles/base.html
always use {} also in anonymous functions, so it is clear where the function starts and where it ends
I disagree.
But, if this is the standard we want to adopt in OSP R code, then it should be part of our R coding standards.
Because both styler and tidyverse style guide find the non-braced syntax for anonymous function to be acceptable under their guidelines:
styler::style_text("(function(x, y) z <- x^2 + y^2)(0:7, 1)")
#> (function(x, y) z <- x^2 + y^2)(0:7, 1)
Created on 2022-04-25 by the reprex package (v2.0.1.9000)
In fact, it'd be highly unusual to use {}
for anonymous functions in tidyverse workflows if there is only a single statement. You can see this on display in every use of anonymous functions on their websites. E.g.
Working with df, dplyr seems to be a clear winner.
lm(wt ~ mpg, mtcars)
Even with all my ~
knowledge (as of Friday :)) , I am not quite sure what this does. This is a lambda but .x
is not used?
Even with all my ~ knowledge (as of Friday :)) , I am not quite sure what this does. This is a lambda but .x is not used?
No, no. I meant to say that the lambda syntax was implemented using ~
because R users were already familiar with it via statistical modeling.
lm(wt ~ mpg, mtcars)
is read as "linearly regress wt
variable on mpg
variable".
Systematic comparison of tidy and base-R equivalent functions.
At least, for a reasonably sized data frame, the tidy solutions are faster and consume less amount of memory. Using or not using the pipe doesn't make much of a difference.
library(dplyr, warn.conflicts = FALSE)
set.seed(123)
x <- rnorm(1e6)
df <- data.frame(x = x)
# dplyr with pipe
df_tidy_pipe <- function(data) {
data %>%
mutate(x_sq = x^2) %>%
filter(x > 0) %>%
arrange(x_sq) %>%
pull(x_sq)
}
# dplyr without pipe
df_tidy_nopipe <- function(data) {
data <- mutate(data, x_sq = x^2)
data <- filter(data, x > 0)
data <- arrange(data, x_sq)
data <- pull(data, x_sq)
return(data)
}
# base-R
df_base <- function(data) {
data <- transform(data, x_sq = x^2)
data <- subset(data, x > 0)
data <- data[order(data$x_sq), , drop = FALSE]
data <- data[["x_sq"]]
return(data)
}
bench::mark(
"base" = df_base(df),
"tidy - pipe" = df_tidy_pipe(df),
"tidy - no pipe" = df_tidy_nopipe(df),
check = TRUE
)[1:8]
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 46.6ms 55.5ms 14.8 69.1MB 38.7
#> 2 tidy - pipe 24ms 28.1ms 29.9 44.4MB 43.0
#> 3 tidy - no pipe 24.7ms 26.6ms 33.1 42MB 42.8
Created on 2022-04-28 by the reprex package (v2.0.1)
@msevestre To answer your question in the PR:
Whether a pipe is used or not will not change how many copies are created. So, if I ever said that, I was clearly in the wrong.
library(dplyr, warn.conflicts = FALSE)
# with pipe
df <- data.frame(x = c(1, 2))
tracemem(df)
#> [1] "<0x106169288>"
df %>%
mutate(x_sq = x^2) %>%
filter(x > 1) %>%
pull(x)
#> tracemem[0x106169288 -> 0x130a4f378]: initialize <Anonymous> mutate_cols mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> tracemem[0x130a4f378 -> 0x130988bd0]: initialize <Anonymous> mutate_cols mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> tracemem[0x106169288 -> 0x1054460e0]: new_data_frame vec_data dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> tracemem[0x1054460e0 -> 0x105445f20]: new_data_frame dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> tracemem[0x105445f20 -> 0x105445e40]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate filter pull %>% eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> [1] 2
untracemem(df)
# without pipe
df2 <- data.frame(x = c(1, 2))
tracemem(df2)
#> [1] "<0x105941f60>"
df2 <- mutate(df2, x_sq = x^2)
#> tracemem[0x105941f60 -> 0x10716f648]: initialize <Anonymous> mutate_cols mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> tracemem[0x10716f648 -> 0x10716f568]: initialize <Anonymous> mutate_cols mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> tracemem[0x105941f60 -> 0x1072b6ae0]: new_data_frame vec_data dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> tracemem[0x1072b6ae0 -> 0x1072b68b0]: new_data_frame dplyr_vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
#> tracemem[0x1072b68b0 -> 0x1072b67d0]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
df2 <- filter(df2, x > 1)
pull(df2, x)
#> [1] 2
untracemem(df2)
Created on 2022-04-28 by the reprex package (v2.0.1)
For me, the benefit of pipe is mostly from a code readability perspective.
I find this:
data %>%
f() %>%
g() %>%
h()
to be in line with my internal monologue
Take
data
and then (%>%) dof()
and then dog()
etc.
But, it's fine not to use it in the package development context where the team might not be familiar with it and where it might impose challenges for debugging.
We have introduced some performant code using dplyr, purr, tidyr packages. But usages of said packages add complexity to the code, specifically for team member like me, who do not know how to use those packages.
we need to document (and it's ok to refer to the documentation if it is clear) the following methods and idiomatic R syntax (To be extended)
Methods
dplyr::mutate
dplyr::across
dplyr::matches
dplyr::bind_rows
purrrr::map (used in purrr::map(yDataList, ~ .yUnitConverter(.x, yTargetUnit)))
Syntax
what is
~
and.x
in.fns = ~ tidyr::replace_na(.x, "missing")
orpurrr::map(yDataList, ~ .yUnitConverter(.x, yTargetUnit))
Why sometime use %>% and sometime assignment e.g.
vs (might not be syntactically correct)