Open topepo opened 6 years ago
On the topic of ...
ellipses, I spent some time digging through code base connecting high-level and low-level APIs and feel as if there must be some minutia or culture of R that I’m missing.
In the context of modeling, what is the best practice for passing a varying number of arguments to a function?
Specifically these 2 cases, the order will affect how you handle “unpacking arguments” within a high-level API:
non-keyworded varying length of arguments to a function (ie foo(8))
keyworded varying length of arguments to a function (named arguments, ie (foo(Args=8))
Is it a best practice to use vars_select
to unpack arguments from ...
?
match.args
sufficient for matching keyworded string arguments? Or is partial matching too susceptible to user input errors? For example, for some modeling method, the function foo would be the top-level api that users experience and some other function (say compute_foo_fit) is used to do the computations. This allows for different interfaces to be used to specify the model that pass common data structures to compute_foo_fit.
And foo
should really be a wrapper:
foo <- function(spec_options, fit_options) {
foo_spec <- create_foo_model_specification(spec_options) # need a naming convention here
fit(foo_spec, fit_options)
}
In the context of modeling, what is the best practice for passing a varying number of arguments to a function?
Is it a best practice to use ‘vars_select’ to unpack arguments from
...
?
With everyday functions, then ...
is the best method. With the tidyverse, we usually use ...
for selectors so there are a few different approaches that we are looking at:
use quo(...)
and pick off the named arguments as the extra params.
require that the extra arguments be placed in their own list (foo(a, b, args = list())
)
@fanny
Also,
It will need to be expanded once we have tidy interface notes/recommendations.
We haven't formally templated out what a tidy model interface looks like (and how it should be implemented). Once we formalize that, it will be much easier to answer questions like the one that you posed.
It's a good question though (and clearly there isn't a definitive answer yet).
This is a start of a draft based on things that have been on my mind for the last few days. It will need to be expanded once we have tidy interface notes/recommendations. It should also be reorganized to be more coherent. This is the model fit analog to topepo/parsnip#41.
<snip>
We distinguish between "top-level"/"user-facing" api's and "low-level"/"computational" api's. The former being the interface between the users of the function (with their needs) and the code that does the estimation/training activities.
When creating model objects, conventions are:
Function names should use snake_case instead of camelCase.
The computational code that fits the model should be decoupled from the interface code to specify the model. For example, for some modeling method, the function
foo
would be the top-level api that users experience and some other function (saycompute_foo_fit
) is used to do the computations. This allows for different interfaces to be used to specify the model that pass common data structures tocompute_foo_fit
.Only user-appropriate data structures should be accommodated for the user-facing function. The underlying computational code should make appropriate transformations to computationally appropriate formats/encodings. For example:
survival::Surv
convention.Design the top-level code for humans. This includes using sensible defaults and protecting against common errors. Design your top-level interface code so that people will not hate you. For example:
Suppose a model can only fit numeric or two-class outcomes and uses maximum likelihood. Instead of providing the user with a
distribution
option that is either "Gaussian" or "Binomial", determine this from the type of the data object (numeric or factor) and set internally. This prevents the user from making a mistake that could haven been avoided.If a model parameter is bounded by some aspect of the data, such as the number of rows or columns, coerce bad values to this range (e.g.
mtry = mini(mtry, ncol(x))
) with an accompanying warning when this is critical information.Parameters that users will commonly modify should be main arguments to the top-level function. Others, especially those that control computational aspects of the fit, should be contained in a
control
object.If the model fit code must produce output, a verbose option should be provided that defaults to no printed output.
message
? We should get a good r-lib recommendation).A test set should never be required when fitting a model.
If internal resampling is involved in the fitting process, there is a strong preference for using
tidymodels
infrastructure so that a common interface (and set of choices) can be used. If this cannot be done (e.g. the resampling occurs in C code), there should be some ability to pass in integer values that define the resamples. In this way, the internal sampling is reproducible.When possible, do not reimplement computations that have been done well elsewhere (tidy or not). For example, kernel methods should use the infrastructure in
kernlab
, exponential family distribution computations should use those in?family
etc.For modeling packages that use random numbers, setting the seed in R should control how random numbers are generated internally. At worst, a random number seed for non-R code (e.g. C, Java) should be an argument to the main modeling function.
If your model passes
...
to another modeling function, consider the names of your functions arguments to avoid conflicts with the argument names of the underlying function.Computational code should (almost) always use
X[,,drop = FALSE]
to make sure that matrices stay matrices.When parallelism is used in the computational code:
Provide an argument to specify the amount (e.g. number of cores if appropriate) and default the function to run sequentially.
Computations should be easily reproducible, even when run in parallel. Parallelism should not be an excuse for irreproducibility.