Addressing R CMD check notes due to tidyeval syntax

gvegayon commented 1 month ago

~Adds + file LICENSE to DESCRIPTION~ Adds LICENSE to the .Rbuildignore file (R convention).
Moves withr to Suggests, as it is only used during testing.
Adds the rlang::.data imports to address masking notes.
Makes use of .data$ for dplyr verbs when needed (see here).
Replaces expressions with strings in dplyr verbs when needed.
~Flags various variables to avoid warnings during R CMD check via utils::globalVariables. All variables in the list are those used in tidybayes::spread_draws calls (which uses expressions).~
Adds some files/folders to .Rbuildignore.
Modifies the NEWS.md document to follow R's convention (see also #118).
Ensures all ggplot2 functions are imported and thus removes the use of ggplot2:: (NOTE: dplyr:: is still heavily used).

gvegayon commented 1 month ago

Missing piece: How to deal with the warning about the license.

gvegayon commented 1 month ago

One how did you do this all in a day. You rock.

Thank you!

Two I will admit I hate this so much... I know the other option is .data[variable] and that's not much better...

@seabbs suggested that there is a way to use autoglobals() and we could also import the common functions to avoid having to use things like ggplot2::ggplot() and such everywhere (though I know there is currently lots of redundancy in this). I just would prefer the code be a bit more readable...

Yes, I don't like it either, but it's the way dplyr says it should be done. Just checked the autoglobals alternative using roxygen tags; looks handy! One question: How reliable is that R package? We wouldn't be adding a dependency for the users, but we would be doing so for the developers. Perhaps avoid the package and use utils::globalVariables()? It would be pretty much the same thing!

Also, calling ggplot2:: like that is a good practice in package development (see here); but there are exceptions.

dylanhmorris commented 1 month ago

Flags various variables to avoid warnings during R CMD check via utils::globalVariables. All variables in the list are those used in tidybayes::spread_draws calls (which uses expressions).

Thanks for this, @gvegayon!

For tidybayes::spread_draws, we can do calls of the form

fit |> tidybayes::spread_draws(!!str2lang("variable name"))

See https://github.com/mjskay/tidybayes/issues/38#issuecomment-895281257

seabbs commented 1 month ago

This is a lot of work so well done @gvegayon but as discussed here I find this code nearly unreadable and the benefits seem very small.

I think if we used the package namespace and imported our core tooling (i.e from dplyr, and ggplot2) this would help a lot with the package::function bloat. I see few if any downsides to doing that and it fits within standard practice.

The downside to not using .data$var is that as @gvegayon points out it is recommended practice and not doing so could add some small danger. However I think the risk from this is fairly marginal in most of this code base so would prefer we use globals (but I feel less strongly about this than the package::function whose removal would make the code much easier to read and so make this issue less).

In terms of just manually assigning the globals. Yes, we could but it is exceptionally tedious. As it's a light dev dependency where any change will run through CI I don't think having it as a dev dependency is a serious issue. @athowes how have you found it?

short cuts on optimal development

The autoglobal is maybe a small short cut but I don't think using the NAMESPACE is.

I think the advice that @gvegayon posted could do with another read. In particular these caveats:

An operator: You can’t call an operator from another package via ::, so you must import it. Examples: the null-coalescing operator %||% from rlang or the original pipe %>% from magrittr.

A function that you use a lot: If importing a function makes your code much more readable, that’s a good enough reason to import it. This literally reduces the number of characters required to call the external function. This can be especially handy when generating user-facing messages, because it makes it more likely that lines in the source correspond to lines in the output.

A function that you call in a tight loop: There is a minor performance penalty associated with ::. It’s on the order of 100ns, so it will only matter if you call the function millions of times.

where the middle point is relevant to us with things like select, mutate, ggplot, aes etc etc./

athowes commented 1 month ago

I don't know that I understand the full context here but I've found using the roxyglobals easy enough.

About the other parts, it's not really my place to say / it's up to what you collectively decide, but if I were developing this package I would probably have reservations about the pkg:: to everything and .data$var points. Perhaps that's just selfishness (it makes development less nice) but I think there are reasonable arguments to be made about code readability reducing the amount of error. With something like putting ggplot2:: I might also question the extent to which this is ever likely to create bugs that aren't obvious / very easy to fix (I think that with some other more internal functions you might expect that an undetected namespace clash has more of a risk of creating a hard to see bug).

kaitejohnson commented 1 month ago

@gvegayon I'm inclined to give the roxyglobals a try (happy to do this in a separate PR and then compare the pros and cons). I think I saw that you had already started adding @autoglobal to some of the new functions you wrote?

kaitejohnson commented 1 month ago

Ok after discussing face-to-face, it seems like this is going to have to be the way to go in terms of the .data[[] and string calls, so will defer to @dylanhmorris on the requested changes.

One remaining question I have is whether we could still do some amount of importing common package functions like ggplot, aes, geom_line, etc. or does that also always cause errors? It doesn't matter too much but I think it might help slightly with readability and is definitely hard to get into the habit of doing for things like ggplot

dylanhmorris commented 1 month ago

Yes, important to separate the question of importing any/many functions or other names from other packages from the question of how to handle column names in tidyeval functions. We can continue to import things like ggplot2::ggplot etc (and note that per my review this PR sticks with importing ggplot2::aes)

dylanhmorris commented 1 month ago

.data[[] and string calls

Clarification that there are some places we'd be able to get away with tidyeval syntax, e.g. this could be just lab_site_index: https://github.com/CDCgov/ww-inference-model/blob/7bcbebc4965b52759f9ba7b80c71f67f5b6cbc7b/R/preprocessing.R#L214

But in other places, we'll definitely need the .data pronoun, e.g. when column names are defined in a variable:

https://github.com/CDCgov/ww-inference-model/blob/7bcbebc4965b52759f9ba7b80c71f67f5b6cbc7b/R/preprocessing.R#L299

But given that:

we'll be using .data of necessity in some places
it's the tidyverse recommendation for packaged code as @gvegayon notes

I favor just using it throughout to be consistent

dylanhmorris commented 1 month ago

In short: I support removing the namespace qualifications from this PR (e.g. revert ggplot2::ggplot() to ggplot()), but keeping the universal use of the .data pronoun. @kaitejohnson, @gvegayon et al does that sound reasonable?

kaitejohnson commented 1 month ago

In short: I support removing the namespace qualifications from this PR (e.g. revert ggplot2::ggplot() to ggplot()), but keeping the universal use of the .data pronoun. @kaitejohnson, @gvegayon et al does that sound reasonable?

Yes I agree. I would also support importing a few other ggplot functions like geom_line(), geom_point(), geom_vline/hline() etc. but again this isn't super critical because I don't intend this package will be doing a ton of plotting anyway.

gvegayon commented 1 month ago

Thanks all for the comments. I will add ggplot2's function to imports and keep the .data syntax.

seabbs commented 1 month ago

.data[[] and string calls

It would be best if having f2f calls to put the resulting conversation/justifications explicitly in the PR notes or it can make decision-making hard to track and seemingly arbitrary.

I assume there you mean their recommendation to use .data$var and not .data[[var]] which they don't make a strong case for that I can see. From the discussion that is here it seems a bit that those things might be conflated (i.e if we are really following dplyr standards we should be supporting NSE as the tool for specifying input in our functions which it doesn't seem like we are).

This general discussion seems like it could be usefully promoted to a wider CFA standards discussion.

I also think this PR really shows the value of more robust issue discussion prior to writing code as this will end up costing @gvegayon quite a lot of time.

kaitejohnson commented 1 month ago

Here's how I would summarize the discussion:

tidyeval/dplyr error prone because of ambiguity with column names versus variables being passed to functions.
R packages can look ugly/difficult to read under the hood. I personally am strongly in favor of readability, especially at this early stage, for those that would want to use this as I think it encourages contributions and makes it more approachable for someone to understand what a function is doing.
Using .data[[var]] whenever a column is being assigned, and using .data$var when something is being referenced, and using .data[["var_name"]] whenever calling on a specific column name, and the !! whenever something is being assigned that is a variable not a column name.... (@gvegayon @dylanhmorris please correct me if I'm wrong/perhaps in this PR we could just clarify that so we know going forward what the standards are)

My takeaway:

package should have never been built using dplyr from the start given this. However, this would have made development a bit slower (for me personally) since despite being a modeling package, there is still a fair bit of data wrangling that needs to happen to format things properly with the two datasets

seabbs commented 1 month ago

error prone because of ambiguity with column names versus variables being passed to functions.

Is there evidence for this assertion?

R packages can look ugly/difficult to read under the hood

They can also not right.

package should have never been built using dplyr f

Interestingly we will be switching epidist over from data.table to dplyr as it makes it much easier to get contributions.

Using

Sounds like a clear plan. Should add this steer to the contributing guide. I would prefer if we are going with dplyr guidance we fully go with dplyr guidance (i.e !! etc)/

gvegayon commented 1 month ago

error prone because of ambiguity with column names versus variables being passed to functions.

Is there evidence for this assertion?

The first thing I can think of is this issue with masking/unmasking caught by CFA in the Rt pipeline: Teams discussion. I was able to reproduce the bug with had to do with NSE here

dylanhmorris commented 1 month ago

Using .data[[var]] whenever a column is being assigned, and using .data$var when something is being referenced, and using .data[["var_name"]] whenever calling on a specific column name, and the !! whenever something is being assigned that is a variable not a column name.... (@gvegayon @dylanhmorris please correct me if I'm wrong/perhaps in this PR we could just clarify that so we know going forward what the standards are)

This is close but not exact. As @seabbs says it may be simplest to follow the conventions in the dplyr vignette, particularly https://dplyr.tidyverse.org/articles/in-packages.html#data-masking-and-tidy-selection-notes

There are a few places the dplyr guidelines are silent, however, and there we can and should just choose a convention and be consistent. I will try to write up a summary.

kaitejohnson commented 1 month ago

Interestingly we will be switching epidist over from data.table to dplyr as it makes it much easier to get contributions.

I think that makes sense. Curious if you all use autoglobals() or follow the dplyr guidelines?

This also points out that we need to write up contributing guidelines...

seabbs commented 1 month ago

Curious if you all use autoglobals() or follow the dplyr guidelines?

Autoglobals for the left hand side and nse for the right as needed I think but we will see how it goes obviously

kaitejohnson commented 1 month ago

Btw since I presume this will be merged pretty close to as is and I expect it will lead to many git conflicts if I don't branch off if it, I am going to work from this PR. I would approve and merge but I see a number of unresolved conversations and I don't want to commit those suggestion myself. Looks to me like they are mostly choice of syntax.

CDCgov / ww-inference-model

Addressing R CMD check notes due to tidyeval syntax #108