[JOSS REVIEW] Code review

To round off my initial review, below are my (very opinionated) thoughts and comments on the code. These are merely suggestions and I do not need to see them be resolved for the rest of the review process to proceed. Please feel free to just pick and choose whatever feels appropriate and ignore the rest. Happy to discuss further on specific points, if any are of interest.

Logging

1) The package feels somewhat too loud - there are a lot of warnings and messages that are thrown out which I fear may be more confusing than helpful for new users (and annoying for experienced users). For example, messaging hints to "Use X function" are currently attached to the print method of the "densify_result" object, which clutters the console every time the user prints the object for inspection. A less verbose pattern would be to move the messages into the densify() function instead, and/or use the cli::cli_inform(.frequency = ) to reduce the frequency of printing such hints. 1) The semantics ascribed to warnings and errors in the package feel a bit out of line with how they're typically used in R. Warnings should be reserved for signaling problems in functionality like unexpected inputs, off-label usage, deprecated behavior, etc., but the package uses warnings more loosely (ex: to tell users that some functions drop attributes - if that simply is the intended behavior of the function, it should at best be a message. Or if it should be discouraged entirely, consider throwing an error instead.). Similarly for errors, there are a couple places where abort(.internal = TRUE) is used inappropriately. These "internal errors" are of higher severity than regular errors: they signal something wrong on the developer's end and is typically a cue for users to open an issue. I don't think .internal = TRUE makes sense where it's currently being used (e.g., when a user specifies something other than one of the three possible options in the density_mean argument; it's on the user to read the docs on what's supported).

The `densify()` function

1) Currently, limits and density_mean_weights accept a named list of arguments, with the available parameters specified in the documentation. This is just a suggestion, but I'd also like to raise the possibility of another, more modular pattern that could be used here, which is to use option/control objects that act as light wrappers around list() that only allow a specific set of names. So for example, densify(..., limits = list(min_coding_density = 1)) could become something like densify(..., limits = opt_limits(min_coding_density = 1)), where opt_limits() checks for input and hosts documentation about the available options for setting limits. 1) I would like to see a bit more said on the examples in densify() docs. Minimally, some code that can be run to inspect the expected structure of the input data and the output data frame.

The `densify_result` object and its methods

1) I really like the design of using the vocabulary from {generics}, and the workflow as summarized in ?densify and ?densify_result. One additional thing that'd be helpful for onboarding new users is if there's a simple diagram/graphic of this workflow in the readme, similar in style to the data science workflow graphic from R4DS. 1) In so far as the package has a more OOP flavor, I think it could benefit from some getter functions on the densify_result object for user convenience (e.g., get_data(),get_scores()). For example, visualize() currently prints something like "use tibble::as_tibble(X$data[[123]]) to obtain the pruned data frame". This is just my personal preference, but IMO it'd be a bit more palatable if this code could instead be something like "use get_data(X, i = 123) to ...", to hide the internal object structure from users. And these getters can be documented alongside ?densify_result. 1) Related to the above, it'd be helpful if the package exported a is_densify_result() function to distinguish densify_result objects from plain data frames / tibbles. 1) Since the visualize() method is a ggplot, it may also be worth extending ggplot2::autoplot(), alongside plot(). On this note, I think ggplot should just be declared in Imports, unless there's a strong reason for low-dependency. If ggplot should stay as Suggests, I'd still recommend checking for its version before loading it (see also rlang::check_installed() or a more friendly interface to this). 1) If possible, the contents of pruning_state_stat_vars should be exported in some form, to allow users to see what variables are available for use inside scoring_function without having to hit the error in eval_densify_results_quality_score() first. 1) Somewhat related to the above point, I'm just curious about the choice of having the scoring_function argument take an expression as opposed to a function (ex: n_data_points * coding_density vs. \(n_data_points, coding_density) n_data_points * coding_density). I wonder if the argument could be renamed to something more generic like scoring_criterion, if that pivot isn't too much work. 1) I see that select() and rename() are extended only for it to throw a warning that it produces undesirable behavior. I'm a bit confused whether the intent here is: 1) you can use dplyr functions on densify_result directly, but take special care for select()/rename(), or 2) please do not use dplyr functions on densify_result directly, especially for select()/rename() which will produce undesirable consequences. As it's written, the warning for just select()/rename() is ambiguous whether it's encouraged/supported to use other dplyr functions directly on densify_result (e.g., mutate()) at all. I think the stance can be better clarified in the vignette and/or message.

@yjunechoe Thank you for the in-depth review and the insightful comments! I would like to briefly present some of our design considerations when building the API, as it might help understand some of the choices. Afterward, we can discuss how the interface can be improved.

Our main design goals were user-friendliness, explicit contracts, and a small API footprint. We assume that densify is not a package that will be used every day. Rather, it is a specialized tool to solve certain types of problems. To increase familiarity, we are borrowing heavily from the tidyverse conventions and existing modeling functions from generics. For example, densify_result is just a tibble with some extra behavior on top of it — but this behavior is only guaranteed if some invariants are met (e.g. ,it will break if some of the columns are removed or modified). This is also why the package is designed to be rather "chatty" and interactive, making the user aware of the additional functionality available to them. Most of the design is an attempt to "safely" work around some of the limitations in R's type system and lazily computed values. An expert user can discard this safety net by casting the result to a tibble and working with it directly.

Regarding the individual points you raise

1.1. These additional messages serve to remind the user that densify-result is more than just a tibble and that additional functionality is available to them. The sentiment that the messages might be too much is understandable. We should look into making them more compact. At the same time, I would prefer not to remove them altogether, as it would make densify_result less distinguishable from a tibble. An expert user can remove all this by casting to tibble, which is the encouraged way (we should make this more clear in the documentation).

1.2 This is a very good point, we will look into it and rework the signals. Thank you for bringing this to our attention!

2.1 and 2.2 Both are great points, we will look into it

3.1 Great suggestion! @annagraff I really like the diagram idea, what do you think?

3.2 Yeah, we spent some time going back and forth on this :) I feel that introducing a new function here would go against the goal of reducing the API interface. Also, I would really want to minimize the OOP feel of the interface (although some of it is inescapable). Since density_result is a tibble, we want to encourage the user to work with this tibble directly. There is an additional technical complication in that the pruned data frames are lazy values. We have to do it this way since eagerly computing all these tables would incur tremendous memory and processing time costs. Unfortunately, R does not currently support storing lazy values in anything that is not an environment. Hence, a verbose API. An alternative would be to forego a data frame structure underpinning densify_result and make it a list-like object instead, but that leads to other issues. To summarize, I don't think that there is currently a good solution. It would be great if a suitable generic existed either in base or in generics. I petitioned the maintainers of the DBI package to make fetch available as a more general generic name without much success.

3.3 This makes sense

3.4 Both great points. Looking at this again, I am actually surprised that we don't have a hard dependency on ggplot2.

3.5 That is documented in the relevant functions and in the ?densify_result. Do you consider this documentation insufficient?

3.6 We thought that this interface would feel most natural to users familiar with tidyverse. Other options would be a formula or a function. I agree with renaming scoring_function to scoring_criterion or maybe even just score. We will look into it. We can also add an alternative function parameter for this (that is a bit more tricky if one wants a safe interface)

3.7 The idea is to inform the user that once they start manipulating this data using dplyr selection predicates, they opt out of the densify_result contract, and the result is just a regular tibble. So it's more of the lines "you can do whatever you want, but be aware that now you are just working with a tibble". We will look into making these messages more clear. Maybe a good way would be to disallow these predicates and ask the user to downcast the data to tibble. All of this is mostly cosmetic, of course, that is again part of the safety net to help the user avoid some potential pitfalls.

Thanks the detailed responses! Most of my points came from a place of not being familiar with the underlying design goals, so this clears up a lot of things for me.

Namely, I'm totally convinced by (1.1), (3.2), (3.7) - it looks like I mischaracterized the densify_result object as leaning into OOP. If it's allowed/supported to be just treated as a data frame for general data-wrangling purposes, the light infrastructure around the object make good sense to me, and the current interface feels very appropriate for this. This is probably just my prior experience biting me, but I'm always suspicious of custom objects that look like a data frame, so I need very clear instructions as to whether I should or should not treat the object as a data frame. But I agree that most users would happily take the hints/instructions that are currently being printed out and just do whatever they're familiar with.

On (3.5) - I think the current documentation is great, so it's probably not be worth it to do (3.5) just because of my selfishness. It's just that on principle I prefer being able to, for example, inspect all available p-value adjustment methods from printing the vector stats::p.adjust.methods without needing to open up the docs for ?stats::p.adjust. I'm likely the minority in caring so much about this detail, so please feel free to ignore this very minor point! :)

I also see the point about tidyverse-style for (3.6) - expressions are probably more natural than functions, so I think it's designed well for that audience. I don't think a new argument accepting a function is necessary.

I have just one follow up on verbosity (1.1), specifically in one of the lines that get printed as part of densify_result:

#> # Use `eval()` to manually retrieve any data frame from column `data`

This one feels a bit too foreign/technical (consider that ?eval isn't informative here) and was one of the reasons why I suggested a getter (3.2). But maybe it suffices to make the message a little bit more informative to avoid any confusion. Just something to think about.

Thanks @yjunechoe!

Regarding (3.5) — the available variables are simply the columns of the densify_result table. Would it address the issue if we clarified this better in the documentation?

I agree with your comment regarding eval(). While the lazy pruned values are unevaluated R expressions, you are right that mentioning eval() here is too technical. I think the as_tibble() or as.data.frame() wrapper should be sufficient. We can mention eval() in the more detailed documentation.

Regarding (3.5) — the available variables are simply the columns of the densify_result table. Would it address the issue if we clarified this better in the documentation?

The documentation is clear! I was just wondering whether the internal pruning_state_stat_vars variable could be exported (perhaps under a more relevant name, like score_vars) just so that information about available variables are also accessible from within R code (vs. reading the docs). Sorry, this wasn't meant to be a point about clarity - I was just curious about the choice to have the variable but not export it (I realize this is just a difference in design preference - sorry for the confusion!).

annagraff / densify