Ropensci and JOSS - Githubissues

DominiqueMakowski commented 5 years ago

What about submitting this to JOSS or Ropensci + JOSS?

strengejacke commented 5 years ago

Good to have a native speaker on board! :-)

pdwaggoner commented 5 years ago

Happy to be here, to say the least.

strengejacke commented 5 years ago

Revision submitted: https://github.com/openjournals/joss-reviews/issues/1412#issuecomment-499106443

Proof compiled: https://github.com/openjournals/joss-papers/blob/joss.01412/joss.01412/10.21105.joss.01412.pdf

DominiqueMakowski commented 5 years ago

Man, I haven't time to re-read the final pdf 😅 this timezone thing is though, we should define some easybreaks with no critical activity for lunches, dinners, parties and whatnot.

But it's okay I am sure the revisions are perfect ☺️

pdwaggoner commented 5 years ago

Ha! I like that: easybreaks. But I hear you - the time is tricky. But rest assured we incorporated your suggestions. It’s looking good!

strengejacke commented 5 years ago

lunches, dinners, parties and whatnot.

Do this once you're retired! Believe me, I studied gerontology ;-)

strengejacke commented 5 years ago

@DominiqueMakowski @pdwaggoner

We got the response from Alex, so let's see what we can address in which way.

find_weights() -> weights_name(), get_weights() -> weights_values()

This is doable, the question is, do we want that? According to our code-style convention, we should stay with get/find. We could, of course, add aliases for all functions if this makes life easier for users.

Note that you have chosen to define terms and variables in a non-standard way for the R community. In an R model formula an entire chunk such as I(x^2) is referred to as a term, rather than a variable. I recommend following this definition, which amounts to flipping the current usage of term and variable.

I'm not sure about the definition, and couldn't yet find proper "definitions". If we really break conventions here, it might be better to flip the names, as suggested. That would mean that we need to use an intermediate version check for insight in our packages where we use one of these functions, else we will have breaking code.

Finally, I recommend grouping get*() and find*() together on the pkgdown site in the Reference section. That is, you want get_parameters() and find_parameters() (or parameter_names() and parameter_values()) to appear next to each other.

Not sure how to do this in pkgdown, except when we "merge" all related find/get-function together in one doc. But that would mess up the help-files, so I would suggest finding another solution.

It makes me nervous that the example at https://easystats.github.io/insight/reference/clean_parameters.html warns about overwriting quosure methods.

This is an R 3.6 thing, and related to other packages. Easy to address, I think (by just responding/explaining)

I appreciate that you've linked to packages that use insight. However, this is not especially useful to users -- instead of demonstrating how to solve a specific problem with insight, the user must now go read an entire package on their own, find a usage of insight, and infer the problem that is being solved. I would construct explicit, self-contained examples showing idiomatic usage. If these are inspired by the linked packages, fantastic, but merely linking to another code base isn't that helpful.

Not sure. Maybe it's easy to address, if we just add one or two code-examples to the paper? If so, that is rather easy to address.

Instead of providing a new generic n_obs() I would provide methods for the existing generic nobs().

On the one hand, yes, we could do that. On the other hand, once other packages add own nobs()-method, users will always see the warning that insight oberwrites S3-methods. But this is probanly not often the case. I would suggest keeping n_obs(), and add nobs() where needed (and then the definition would be nobs.xyz <- n_obs.xyz), what do you think?

DominiqueMakowski commented 5 years ago

1) > find_weights() -> weights_name(), get_weights() -> weights_values()

This is doable, the question is, do we want that? According to our code-style convention, we should stay with get/find. We could, of course, add aliases for all functions if this makes life easier for users.

I agree with you.

2) > Note that you have chosen to define terms and variables in a non-standard way for the R community. In an R model formula an entire chunk such as I(x^2) is referred to as a term, rather than a variable. I recommend following this definition, which amounts to flipping the current usage of term and variable.

I'm not sure about the definition, and couldn't yet find proper "definitions". If we really break conventions here, it might be better to flip the names, as suggested. That would mean that we need to use an intermediate version check for insight in our packages where we use one of these functions, else we will have breaking code.

> model <- lm(Sepal.Length ~ I(Sepal.Width^2), data = iris)
> insight::find_variables(model)
$response
[1] "Sepal.Length"

$conditional
[1] "I(Sepal.Width^2)"

> insight::find_terms(model)
$response
[1] "Sepal.Length"

$conditional
[1] "Sepal.Width"

I am okay with swapping the two. Better doing the breaking change now than in the future.

Finally, I recommend grouping get() and find() together on the pkgdown site in the Reference section. That is, you want get_parameters() and find_parameters() (or parameter_names() and parameter_values()) to appear next to each other. Not sure how to do this in pkgdown, except when we "merge" all related find/get-function together in one doc. But that would mess up the help-files, so I would suggest finding another solution.

AFAIK the only way to do that is indeed to have the same dosctring for both functions using @rdname.I think it's doable in most cases to merge the docstrings.

t makes me nervous that the example at https://easystats.github.io/insight/reference/clean_parameters.html warns about overwriting quosure methods. This is an R 3.6 thing, and related to other packages. Easy to address, I think (by just responding/explaining)

Agreed.

I appreciate that you've linked to packages that use insight. However, this is not especially useful to users -- instead of demonstrating how to solve a specific problem with insight, the user must now go read an entire package on their own, find a usage of insight, and infer the problem that is being solved. I would construct explicit, self-contained examples showing idiomatic usage. If these are inspired by the linked packages, fantastic, but merely linking to another code base isn't that helpful. Not sure. Maybe it's easy to address, if we just add one or two code-examples to the paper? If so, that is rather easy to address.

I think what is lacking here is a small "Example" section on the README. We could demonstrate some of the find_* functions (as the get_ ones have a long output not really appropriate for a README) for a model such as the one above (with the iris dataset)? I can give it a try.

Instead of providing a new generic n_obs() I would provide methods for the existing generic nobs(). On the one hand, yes, we could do that. On the other hand, once other packages add own nobs()-method, users will always see the warning that insight oberwrites S3-methods. But this is probanly not often the case. I would suggest keeping n_obs(), and add nobs() where needed (and then the definition would be nobs.xyz <- n_obs.xyz), what do you think?

I agree, we should not redefine nobs as it is indeed used by quite many packages, but rather add the method if necessary.

pdwaggoner commented 5 years ago

These are all some good thoughts from Alex as well as from you both on moving forward. My broader question is in the scope and location of changes requested. It's unclear where Alex is requesting changes: in the paper or in the repo/software. A response to this would influence where and how we proceed in making corrections. After reading me responses below, in addition to this question, let me know how you'd like to proceed with corrections to get this thing published! We are close. Any thoughts there?

Finally, here are a few thoughts on numbers where is makes most sense for me to respond:

This is doable, the question is, do we want that? According to our code-style convention, we should stay with get/find. We could, of course, add aliases for all functions if this makes life easier for users.

I also agree that we should stick to (and thus defend, appropriately) our code-style convention.

I'm not sure about the definition, and couldn't yet find proper "definitions". If we really break conventions here, it might be better to flip the names, as suggested. That would mean that we need to use an intermediate version check for insight in our packages where we use one of these functions, else we will have breaking code.

I also agree that making a switch consistent with broader R conventions is probably a good idea, though, I too was unaware of this broader community definition. Not sure if this is worth fighting for or not...

Not sure. Maybe it's easy to address, if we just add one or two code-examples to the paper? If so, that is rather easy to address.

I agree with @DominiqueMakowski here - I think we have demonstrated usability for a "problem" a user may encounter in the paper already (though see my comments at the outset here, on whether Alex is wanting this change in the documentation or in the paper). If its the paper, I think we gently point him to the fact that we have addressed this already, but that we could add a sentence clarifying that we have demonstrated a "real-world" problem and how insight helps with that. But if he is wanting the change in the documentation, then I like @DominiqueMakowski 's suggestion here of adding a couple more "applied" examples to the README. Let me know if you need my help here at all!

Oh also, for what its worth, I agree with Alex that the name should maybe switch back to the original, as we aren't necessary introducing new syntax.

strengejacke commented 5 years ago

I have found following pages: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html https://bookdown.org/ccolonescu/RPoE4/indvars.html https://www.datacamp.com/community/tutorials/r-formula-tutorial

These indicate that we indeed swapped "variable" and "term" and thus should flip the functions.

strengejacke commented 5 years ago

@DominiqueMakowski I think we have addressed most issues from the 2nd review round now. Missing is the example.

I think what is lacking here is a small "Example" section on the README. We could demonstrate some of the find* functions (as the get ones have a long output not really appropriate for a README) for a model such as the one above (with the iris dataset)? I can give it a try.

Do you think you find some time for this the next days, or should I start working on this?

DominiqueMakowski commented 5 years ago

as you wish if you already have an idea of what to do please don't wait for me, I'll check and expand if need be

pdwaggoner commented 5 years ago

Hi guys - apologies for my silence lately on this. Teaching for the summer is nearly finished! I can give the example a try in about 3 hours when I finished teaching. I should have something shortly, if you want to wait. Obviously, if you'd like to move faster (< 3 hr.), please feel free to do so.

strengejacke commented 5 years ago

Ok, I'll wait for your input, I'm working on another issue anyway (the nobs()-thing Alex mentioned).

strengejacke commented 5 years ago

@pdwaggoner @DominiqueMakowski

There's one comment:

I appreciated the new definitions of the various components of a model. Instead of presenting this material all at once, I recommend you group the definitions by components that are easy to confuse and set these groups visually apart.

I'm not 100% sure what is actually meant, do you have an idea?

DominiqueMakowski commented 5 years ago

Maaaaybe something like (strong speculation here):

data: the dataset used to fit the model.
response: the outcome or response variable (dependent variable) of a regression model.
predictors: independent variables of (the fixed part of) a regression model. For mixed models, variables that are (only) in the random effects part of the model are not returned as predictors by default, however, these can be returned using additional arguments to the function call. Predictors are "unique", hence if a variable appears as fixed effect and random slope, it is considered as one predictor (it is the same variable).
- parameters: values estimated or learned from data that encapsulate the relationship between variables. In regressions, these are usually referred to as coefficients.
- variable: any unique variable names that appear in a regression model, e.g., response variable, predictors or random effects. A "variable" only relates to the unique occurence of a term, or the term name. For instance, the expression x + poly(x, 2) has only the variable x.
- term: terms themselves consist of variable and factor names separated by operators, or involve arithmetic expressions. For instance, the expression x + poly(x, 2) has one variable x, but two terms x and poly(x, 2).
random or grouping factors: variables that are used as grouping variables in a mixed effects model.
- random slopes: variables that are used as random slope in a mixed effects model.

(btw I spotted a typo in Predictors are "unqiue"

pdwaggoner commented 5 years ago

Ok, what do you think about this, covering most of the find_*() functions? If you like it, I am happy to include a new section on the README with this.

Let me know:

> library(insight)
> # Fit the model
> model <- lm(Sepal.Length ~ I(Sepal.Width ^ 2), 
+             data = iris)

> # find_algorithm: Find sampling algorithm and optimizers
> find_algorithm(model)
$algorithm
[1] "OLS"

> # find_formula: Find model formula
> find_formula(model)
$conditional
Sepal.Length ~ I(Sepal.Width^2)

> # find_interactions: Find interaction terms from models
> find_interactions(model)
NULL

> ## Fit INXN model
> model_inxn <- lm(Sepal.Length ~ Petal.Length * Sepal.Width, 
+             data = iris)

> find_interactions(model_inxn)
$conditional
[1] "Petal.Length:Sepal.Width"

> # find_parameters: Find names of model parameters
> find_parameters(model)
$conditional
[1] "(Intercept)"      "I(Sepal.Width^2)"

> find_parameters(model_inxn)
$conditional
[1] "(Intercept)"              "Petal.Length"             "Sepal.Width"              "Petal.Length:Sepal.Width"

> # find_predictors: Find names of model predictors
> find_predictors(model)
$conditional
[1] "Sepal.Width"

> find_predictors(model_inxn)
$conditional
[1] "Petal.Length" "Sepal.Width" 

> # find_response: Find name of the response variable
> find_response(model)
[1] "Sepal.Length"

> # find_terms: Find all model terms
> find_terms(model)
$response
[1] "Sepal.Length"

$conditional
[1] "I(Sepal.Width^2)"

> find_terms(model_inxn)
$response
[1] "Sepal.Length"

$conditional
[1] "Petal.Length" "Sepal.Width" 

> # find_variables: Find names of all variables
> find_variables(model)
$response
[1] "Sepal.Length"

$conditional
[1] "Sepal.Width"

pdwaggoner commented 5 years ago

Also, I just merged a PR fixing that typo @DominiqueMakowski . Good catch!

DominiqueMakowski commented 5 years ago

Another (complementary) idea to demonstrate the usefulness of the package (rather than demonstrating the functions) would be to underline the "unified interface" aspect.

What does it allow to do? For instance, let's say we have a lot of models of different nature:

model1 <- lm(...)
model2 <- lmer(...)
model3 <- stan_glm(...)
model4 <- glmmTMB(...)
model5 <- gamm4(...)

With insight we can easily loop through them and apply the same insight functions; it works in a consistent, neat fashion.

for(model in c(model1, model2, ...)){
  insight::find_parameters(model)
}

strengejacke commented 5 years ago

I would put this into a specific context, e.g. credible intervals for random effects of zero.inflated component:

m <- download_model("brms_zi_3")
sapply(get_parameters(m, effects = "random", component = "zi"), quantile, probs = c(.025, .975))

Or only for random effects of conditional component, so we can apply it to rstanarm and brms models

sapply(get_parameters(m, effects = "random", component = "conditonal"), quantile, probs = c(.025, .975))

or something like, how to use it for predict():

m <- lm(Sepal.Length ~ Species + Petal.Width + Sepal.Width, data = iris)

dat <- get_data(m)
pred <- find_predictors(m, flatten = TRUE)

l <- lapply(pred, function(x) {
  if (is.numeric(dat[[x]]))
    mean(dat[[x]])
  else
    unique(dat[[x]])
})

names(l) <- pred
cbind(l, predictions = predict(m, newdata = as.data.frame(l)))

strengejacke commented 5 years ago

And according to the second remaining issue, I would slightly modify your suggestion:

data: the dataset used to fit the model.
response: the outcome or response variable (dependent variable) of a regression model.
predictors: independent variables of (the fixed part of) a regression model. For mixed models, variables that are (only) in the random effects part of the model are not returned as predictors by default, however, these can be returned using additional arguments to the function call. Predictors are "unique", hence if a variable appears as fixed effect and random slope, it is considered as one predictor (it is the same variable).
- parameters: values estimated or learned from data that encapsulate the relationship between variables. In regressions, these are usually referred to as coefficients.
- variable: any unique variable names that appear in a regression model, e.g., response variable, predictors or random effects. A "variable" only relates to the unique occurence of a term, or the term name. For instance, the expression x + poly(x, 2) has only the variable x.
- term: terms themselves consist of variable and factor names separated by operators, or involve arithmetic expressions. For instance, the expression x + poly(x, 2) has one variable x, but two terms x and poly(x, 2).
random effects
- grouping factors: variables that are used as grouping variables in a mixed effects model.
- random slopes: variables that are used as random slope in a mixed effects model.

strengejacke commented 5 years ago

maybe according to the examples, we can mix both my suggestion Dominiques suggesstion of how to demonstrate the usefulness? And I would put these examples before the examples from other packages in the paper, and probably somewhere at the end of the readme?

pdwaggoner commented 5 years ago

Great suggestions - I was just working on something similar, as you wouldn't be able to loop over model objects, thereby necessitating use of the *apply family. (And I agree on placement both in the paper and in the README; I am happy to make those changes once you guys OK the example) Here is my idea. What do you think?


# model 1 - lm
model1 <- lm(Sepal.Length ~ I(Sepal.Width ^ 2), data = iris)

# model 2 - glm
model2 <- glm(am ~ mpg + hp, family = binomial, data = mtcars)

# model 3 - gamm4
library(gamm4)
set.seed(0) 
dat <- gamSim(1, n = 400, scale = 2)
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~fac - 1) %*% rnorm(20) * 0.5
model3 <- gamm4(y ~ s(x0) + x1 + s(x2), data = dat, random = ~(1 | fac))

models <- paste0("model", 1:3)

sapply(models, function(x)insight::find_parameters(get(x)))
``

strengejacke commented 5 years ago

What do you think? https://easystats.github.io/insight/articles/insight.html

See also readme: https://github.com/easystats/insight/

Next, I will add the examples.

pdwaggoner commented 5 years ago

These are really great changes! For the README examples, are you thinking more technical (e.g., using sapply)? Or are you thinking to essentially replicate the pkgdown examples? And should we then use these same examples again in the paper? Or leave the current paper examples as they are?

strengejacke commented 5 years ago

For the README examples, are you thinking more technical (e.g., using sapply)?

The examples we already have in the paper are great for demonstrating how easy it is to access model information.

I think what we should add, and probably what Alex meant, are examples that show which real-world-problems can be solved with insight? Thus my idea to describe in which packages it is already used, but he said that this would mean for users they must look up the code in that repositories.

So I think we need examples that do something small, like "predictions" or "credible intervals" etc. Something like a "higher order problem" that can be solved using insight. And this I would add to the paper (before pkg examples) and readme. And I would preserve your examples.

DominiqueMakowski commented 5 years ago

That seems fair. Although the main benefit of using insight is its "universality". Hence, we could also showcase how one can benefit from insight with a "dumb" example like the following:

Let's create a function that nicely prints all the parameters of a model:

print_params <- function(model){
  paste0("My parameters are ",
         paste0(row.names(summary(model)$coefficients),  collapse = ", "),
         ", thank you for your attention!")
}

Let's test it:

print_params(lm(Sepal.Length ~ Petal.Width, data = iris))

[1] "My parameters are (Intercept), Petal.Width, thank you for your attention!"

Great, it works! Let's do this with a GAM:

print_params(mgcv::gam(Sepal.Length ~ Petal.Width + s(Petal.Length), data = iris))

[1] "My parameters are , thank you for your attention!"

Oh no, it doens't work for this class of models 😢

As the access to models depends on the type of the model in the R ecosystem, we would need to create specific methods for all types 😱

"It's a bird! It's a plane! It's insight!"

With insight, you can write a function without having to worry about the model type 👍

print_params <- function(model){
  paste0("My parameters are ",
         paste0(insight::find_parameters(model, flatten = TRUE),  collapse = ", "),
         ", thank you for your attention!")
}

print_params(lm(Sepal.Length ~ Petal.Width, data = iris))

[1] "My parameters are (Intercept), Petal.Width, thank you for your attention!"

print_params(mgcv::gam(Sepal.Length ~ Petal.Width + s(Petal.Length), data = iris))

[1] "My parameters are (Intercept), Petal.Width, s(Petal.Length), thank you for your attention!"

strengejacke commented 5 years ago

Yeah, that could be a good "summary" example to point out its universality.

pdwaggoner commented 5 years ago

Love this idea @DominiqueMakowski . I was missing the "real world" application here, and rather just demonstrating the use of the package. But this is a great approach to showing the functionality of the package. Love it.

humanfactors commented 5 years ago

Great example! Took two reads (running on low sleep) but clearly conveys the purpose and utility of insight.

strengejacke commented 5 years ago

Took two reads (running on low sleep)

That's fair statistic, and a convincing argument for any text, actually - "if takes no more than two reads before you had a coffee, people understand what you've written" :-D

strengejacke commented 5 years ago

10.21105.joss.01412.pdf

Revised paper, I'll response to Alex' comments now.

strengejacke commented 5 years ago

Closing, Alex has recommended accepting the paper.

pdwaggoner commented 5 years ago

Super exciting! Great work guys.

strengejacke commented 5 years ago

https://doi.org/10.5281/zenodo.3255073

easystats / insight

Ropensci and JOSS #56

random slopes: variables that are used as random slope in a mixed effects model.

random slopes: variables that are used as random slope in a mixed effects model.