Closed gvegayon closed 1 month ago
Missing piece: How to deal with the warning about the license.
One how did you do this all in a day. You rock.
Thank you!
Two I will admit I hate this so much... I know the other option is
.data[variable]
and that's not much better...@seabbs suggested that there is a way to use
autoglobals()
and we could also import the common functions to avoid having to use things likeggplot2::ggplot()
and such everywhere (though I know there is currently lots of redundancy in this). I just would prefer the code be a bit more readable...
Yes, I don't like it either, but it's the way dplyr
says it should be done. Just checked the autoglobals alternative using roxygen tags; looks handy! One question: How reliable is that R package? We wouldn't be adding a dependency for the users, but we would be doing so for the developers. Perhaps avoid the package and use utils::globalVariables()
? It would be pretty much the same thing!
Also, calling ggplot2::
like that is a good practice in package development (see here); but there are exceptions.
- Flags various variables to avoid warnings during R CMD check via
utils::globalVariables
. All variables in the list are those used intidybayes::spread_draws
calls (which uses expressions).
Thanks for this, @gvegayon!
For tidybayes::spread_draws
, we can do calls of the form
fit |> tidybayes::spread_draws(!!str2lang("variable name"))
See https://github.com/mjskay/tidybayes/issues/38#issuecomment-895281257
This is a lot of work so well done @gvegayon but as discussed here I find this code nearly unreadable and the benefits seem very small.
I think if we used the package namespace and imported our core tooling (i.e from dplyr, and ggplot2) this would help a lot with the package::function
bloat. I see few if any downsides to doing that and it fits within standard practice.
The downside to not using .data$var
is that as @gvegayon points out it is recommended practice and not doing so could add some small danger. However I think the risk from this is fairly marginal in most of this code base so would prefer we use globals (but I feel less strongly about this than the package::function whose removal would make the code much easier to read and so make this issue less).
In terms of just manually assigning the globals. Yes, we could but it is exceptionally tedious. As it's a light dev dependency where any change will run through CI I don't think having it as a dev dependency is a serious issue. @athowes how have you found it?
short cuts on optimal development
The autoglobal is maybe a small short cut but I don't think using the NAMESPACE is.
I think the advice that @gvegayon posted could do with another read. In particular these caveats:
An operator: You can’t call an operator from another package via ::, so you must import it. Examples: the null-coalescing operator %||% from rlang or the original pipe %>% from magrittr.
A function that you use a lot: If importing a function makes your code much more readable, that’s a good enough reason to import it. This literally reduces the number of characters required to call the external function. This can be especially handy when generating user-facing messages, because it makes it more likely that lines in the source correspond to lines in the output.
A function that you call in a tight loop: There is a minor performance penalty associated with ::. It’s on the order of 100ns, so it will only matter if you call the function millions of times.
where the middle point is relevant to us with things like select, mutate, ggplot, aes etc etc./
I don't know that I understand the full context here but I've found using the roxyglobals
easy enough.
About the other parts, it's not really my place to say / it's up to what you collectively decide, but if I were developing this package I would probably have reservations about the pkg::
to everything and .data$var
points. Perhaps that's just selfishness (it makes development less nice) but I think there are reasonable arguments to be made about code readability reducing the amount of error. With something like putting ggplot2::
I might also question the extent to which this is ever likely to create bugs that aren't obvious / very easy to fix (I think that with some other more internal functions you might expect that an undetected namespace clash has more of a risk of creating a hard to see bug).
@gvegayon I'm inclined to give the roxyglobals
a try (happy to do this in a separate PR and then compare the pros and cons). I think I saw that you had already started adding @autoglobal
to some of the new functions you wrote?
Ok after discussing face-to-face, it seems like this is going to have to be the way to go in terms of the .data[[]
and string calls, so will defer to @dylanhmorris on the requested changes.
One remaining question I have is whether we could still do some amount of importing common package functions like ggplot
, aes
, geom_line
, etc. or does that also always cause errors? It doesn't matter too much but I think it might help slightly with readability and is definitely hard to get into the habit of doing for things like ggplot
Yes, important to separate the question of importing any/many functions or other names from other packages from the question of how to handle column names in tidyeval functions. We can continue to import things like ggplot2::ggplot
etc (and note that per my review this PR sticks with importing ggplot2::aes
)
.data[[] and string calls
Clarification that there are some places we'd be able to get away with tidyeval syntax, e.g. this could be just lab_site_index
:
https://github.com/CDCgov/ww-inference-model/blob/7bcbebc4965b52759f9ba7b80c71f67f5b6cbc7b/R/preprocessing.R#L214
But in other places, we'll definitely need the .data
pronoun, e.g. when column names are defined in a variable:
But given that:
.data
of necessity in some placesI favor just using it throughout to be consistent
In short: I support removing the namespace qualifications from this PR (e.g. revert ggplot2::ggplot()
to ggplot()
), but keeping the universal use of the .data
pronoun. @kaitejohnson, @gvegayon et al does that sound reasonable?
In short: I support removing the namespace qualifications from this PR (e.g. revert
ggplot2::ggplot()
toggplot()
), but keeping the universal use of the.data
pronoun. @kaitejohnson, @gvegayon et al does that sound reasonable?
Yes I agree. I would also support importing a few other ggplot functions like geom_line()
, geom_point()
, geom_vline/hline()
etc. but again this isn't super critical because I don't intend this package will be doing a ton of plotting anyway.
Thanks all for the comments. I will add ggplot2
's function to imports and keep the .data
syntax.
.data[[] and string calls
It would be best if having f2f calls to put the resulting conversation/justifications explicitly in the PR notes or it can make decision-making hard to track and seemingly arbitrary.
I assume there you mean their recommendation to use .data$var
and not .data[[var]]
which they don't make a strong case for that I can see. From the discussion that is here it seems a bit that those things might be conflated (i.e if we are really following dplyr
standards we should be supporting NSE as the tool for specifying input in our functions which it doesn't seem like we are).
This general discussion seems like it could be usefully promoted to a wider CFA standards discussion.
I also think this PR really shows the value of more robust issue discussion prior to writing code as this will end up costing @gvegayon quite a lot of time.
Here's how I would summarize the discussion:
dplyr
error prone because of ambiguity with column names versus variables being passed to functions..data[[var]]
whenever a column is being assigned, and using .data$var
when something is being referenced, and using .data[["var_name"]]
whenever calling on a specific column name, and the !!
whenever something is being assigned that is a variable not a column name.... (@gvegayon @dylanhmorris please correct me if I'm wrong/perhaps in this PR we could just clarify that so we know going forward what the standards are) My takeaway:
dplyr
from the start given this. However, this would have made development a bit slower (for me personally) since despite being a modeling package, there is still a fair bit of data wrangling that needs to happen to format things properly with the two datasetserror prone because of ambiguity with column names versus variables being passed to functions.
Is there evidence for this assertion?
R packages can look ugly/difficult to read under the hood
They can also not right.
package should have never been built using dplyr f
Interestingly we will be switching epidist
over from data.table
to dplyr
as it makes it much easier to get contributions.
Using
Sounds like a clear plan. Should add this steer to the contributing guide. I would prefer if we are going with dplyr guidance we fully go with dplyr guidance (i.e !! etc)/
error prone because of ambiguity with column names versus variables being passed to functions.
Is there evidence for this assertion?
The first thing I can think of is this issue with masking/unmasking caught by CFA in the Rt pipeline: Teams discussion. I was able to reproduce the bug with had to do with NSE here
Using .data[[var]] whenever a column is being assigned, and using .data$var when something is being referenced, and using .data[["var_name"]] whenever calling on a specific column name, and the !! whenever something is being assigned that is a variable not a column name.... (@gvegayon @dylanhmorris please correct me if I'm wrong/perhaps in this PR we could just clarify that so we know going forward what the standards are)
This is close but not exact. As @seabbs says it may be simplest to follow the conventions in the dplyr
vignette, particularly
https://dplyr.tidyverse.org/articles/in-packages.html#data-masking-and-tidy-selection-notes
There are a few places the dplyr
guidelines are silent, however, and there we can and should just choose a convention and be consistent. I will try to write up a summary.
Interestingly we will be switching epidist over from data.table to dplyr as it makes it much easier to get contributions.
I think that makes sense. Curious if you all use autoglobals()
or follow the dplyr
guidelines?
This also points out that we need to write up contributing guidelines...
Curious if you all use autoglobals() or follow the dplyr guidelines?
Autoglobals for the left hand side and nse for the right as needed I think but we will see how it goes obviously
Btw since I presume this will be merged pretty close to as is and I expect it will lead to many git conflicts if I don't branch off if it, I am going to work from this PR. I would approve and merge but I see a number of unresolved conversations and I don't want to commit those suggestion myself. Looks to me like they are mostly choice of syntax.
+ file LICENSE
toDESCRIPTION
~ AddsLICENSE
to the.Rbuildignore
file (R convention).withr
toSuggests
, as it is only used during testing.rlang::.data
imports to address masking notes..data$
fordplyr
verbs when needed (see here).dplyr
verbs when needed.utils::globalVariables
. All variables in the list are those used intidybayes::spread_draws
calls (which uses expressions).~.Rbuildignore
.NEWS.md
document to follow R's convention (see also #118).ggplot2
functions are imported and thus removes the use ofggplot2::
(NOTE:dplyr::
is still heavily used).