conversation about model function principles

topepo commented 6 years ago

This is a start of a draft based on things that have been on my mind for the last few days. It will need to be expanded once we have tidy interface notes/recommendations. It should also be reorganized to be more coherent. This is the model fit analog to topepo/parsnip#41.

<snip>

We distinguish between "top-level"/"user-facing" api's and "low-level"/"computational" api's. The former being the interface between the users of the function (with their needs) and the code that does the estimation/training activities.

When creating model objects, conventions are:

Function names should use snake_case instead of camelCase.
The computational code that fits the model should be decoupled from the interface code to specify the model. For example, for some modeling method, the function foo would be the top-level api that users experience and some other function (say compute_foo_fit) is used to do the computations. This allows for different interfaces to be used to specify the model that pass common data structures to compute_foo_fit.
- At a minimum, the top-level model function should be a generic with methods for data frames, formulas, and possibly recipes. These methods error trap bad arguments and format the data format the data to pass along to the lower-level computational code.
- Do not require users to create dummy variables from their categorical predictors. Provide a formula and/or recipe interface to your model to do this (see the next item).
Only user-appropriate data structures should be accommodated for the user-facing function. The underlying computational code should make appropriate transformations to computationally appropriate formats/encodings. For example:
- Categorical data should be in factor variables (as opposed to binary or integer representations).
- Rectangular data structures that allows for mixed data types should always be used even when the envisioned application would only make sense when the data are single type.
- For univariate response models, a vector input is acceptable for the outcome argument.
- Censored data should follow the survival::Surv convention.
Design the top-level code for humans. This includes using sensible defaults and protecting against common errors. Design your top-level interface code so that people will not hate you. For example:
- Suppose a model can only fit numeric or two-class outcomes and uses maximum likelihood. Instead of providing the user with a distribution option that is either "Gaussian" or "Binomial", determine this from the type of the data object (numeric or factor) and set internally. This prevents the user from making a mistake that could haven been avoided.
- If a model parameter is bounded by some aspect of the data, such as the number of rows or columns, coerce bad values to this range (e.g. mtry = mini(mtry, ncol(x))) with an accompanying warning when this is critical information.
Parameters that users will commonly modify should be main arguments to the top-level function. Others, especially those that control computational aspects of the fit, should be contained in a control object.
If the model fit code must produce output, a verbose option should be provided that defaults to no printed output.
- When writing logging information, follow this convention (to be named later; maybe message? We should get a good r-lib recommendation).
A test set should never be required when fitting a model.
If internal resampling is involved in the fitting process, there is a strong preference for using tidymodels infrastructure so that a common interface (and set of choices) can be used. If this cannot be done (e.g. the resampling occurs in C code), there should be some ability to pass in integer values that define the resamples. In this way, the internal sampling is reproducible.
When possible, do not reimplement computations that have been done well elsewhere (tidy or not). For example, kernel methods should use the infrastructure in kernlab, exponential family distribution computations should use those in ?family etc.
For modeling packages that use random numbers, setting the seed in R should control how random numbers are generated internally. At worst, a random number seed for non-R code (e.g. C, Java) should be an argument to the main modeling function.
If your model passes ... to another modeling function, consider the names of your functions arguments to avoid conflicts with the argument names of the underlying function.
Computational code should (almost) always use X[,,drop = FALSE] to make sure that matrices stay matrices.
When parallelism is used in the computational code:
- Provide an argument to specify the amount (e.g. number of cores if appropriate) and default the function to run sequentially.
- Computations should be easily reproducible, even when run in parallel. Parallelism should not be an excuse for irreproducibility.

fbchow commented 6 years ago

Ellipses, Keyword vs Non-keyword Arguments

On the topic of ... ellipses, I spent some time digging through code base connecting high-level and low-level APIs and feel as if there must be some minutia or culture of R that I’m missing.

In the context of modeling, what is the best practice for passing a varying number of arguments to a function?

Specifically these 2 cases, the order will affect how you handle “unpacking arguments” within a high-level API:

non-keyworded varying length of arguments to a function (ie foo(8))
keyworded varying length of arguments to a function (named arguments, ie (foo(Args=8))

Is it a best practice to use vars_select to unpack arguments from ... ?

Defensive Programing / Argument Validation

Is using match.args sufficient for matching keyworded string arguments? Or is partial matching too susceptible to user input errors?

alexpghayes commented 6 years ago

For example, for some modeling method, the function foo would be the top-level api that users experience and some other function (say compute_foo_fit) is used to do the computations. This allows for different interfaces to be used to specify the model that pass common data structures to compute_foo_fit.

And foo should really be a wrapper:

foo <- function(spec_options, fit_options) {
  foo_spec <- create_foo_model_specification(spec_options) # need a naming convention here
  fit(foo_spec, fit_options)
}

topepo commented 6 years ago

In the context of modeling, what is the best practice for passing a varying number of arguments to a function?

Is it a best practice to use ‘vars_select’ to unpack arguments from ... ?

With everyday functions, then ... is the best method. With the tidyverse, we usually use ... for selectors so there are a few different approaches that we are looking at:

use quo(...) and pick off the named arguments as the extra params.
require that the extra arguments be placed in their own list (foo(a, b, args = list()))

topepo commented 6 years ago

@fanny

Also,

It will need to be expanded once we have tidy interface notes/recommendations.

We haven't formally templated out what a tidy model interface looks like (and how it should be implemented). Once we formalize that, it will be much easier to answer questions like the one that you posed.

It's a good question though (and clearly there isn't a definitive answer yet).

alexpghayes / principles

conversation about model function principles #4

Ellipses, Keyword vs Non-keyword Arguments

Defensive Programing / Argument Validation