Suggestions for improving the vignettes

Big-Life-Lab / recodeflow

Harmonizing data into a common format.

https://big-life-lab.github.io/recodeflow/

Other

6 stars 1 forks source link

Suggestions for improving the vignettes #35

Open reikookamoto opened 1 year ago

reikookamoto commented 1 year ago

Adopt conventions when writing variable, function, and package names in addition to file paths
Clarify terminology
- What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
- tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data
Clarify the order in which a new user should read the vignettes
Proofread for grammatical and spelling errors
Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)
Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)
DT package (used across several files) is in maintenance-only mode so consider using another package for making tables
- https://gt.rstudio.com/
- https://hughjonesd.github.io/huxtable/

Questions that crossed my mind

Can fields be left blank in the sheets?
Are there data type checks that confirm that the data in the sheets are of the correct data type?
- E.g., we wouldn't expect numeric data in the label column of the variables sheet
Under what circumstances does the details sheet get updated after calling rec_with_table()?

reikookamoto commented 1 year ago

how_to_install.Rmd

Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)
Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:

# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)

If we put our package on the R-universe, users can install the development version simply using install.packages()
- https://jeroen.github.io/runiverse2021/#15

reikookamoto commented 1 year ago

variables_sheet.Rmd

State the source of the sample data
State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g., elderflow)
- Provide the expected data types for each scenario
Unclear how variables was created in the rendered documentation
- The variable sheet is referred to as variables.csv but there's no indication that a CSV file was imported
"Try sorting the subject column by clicking the up..."
- The top 10 rows of the table doesn't show what's expected according to the vignette
- - You'd need to filter(subject == "lab") to see what's expected
Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?
Can we clarify the difference between section and subject?
Can the interactive table take up less space?
The section "Derived Variables" looks incomplete
- Complete it or at least define what a derived variable is and then point users to another vignette
- You can link one vignette to another - https://r-pkgs.org/vignettes.html#links

reikookamoto commented 1 year ago

variable_details.Rmd

Explicitly state purpose of vignette (i.e., how to organize a details sheet)
Unclear how variable_details was created in the rendered documentation
- The sheet is referred to as variable_details.csv but there's no indication that a CSV file was imported
_"Each row in variable_details gives instructions to recode a single category of a final variable"_
- This statement doesn't apply to non-categorical variables
Can we refrain from using the dollar sign "$" to refer to a variable in the dataset?
- It's more likely that users will interact with these sheets via Excel (i.e., using base R syntax may be confusing for novice users of the package)
Can the tables take up less space?
- Lots of scrolling is required to read through the vignette
- Can we also exclude the row ID values?
As a novice user, I found it difficult to wrap my head around the information included in Table 1
- It's helpful to know (1) which columns are optional and (2) that the order in which the columns appear don't matter but are the other pieces of information necessary?
variableStart: "If the variable name in a particular dataset is different from the recoded variable name..."
- Does the original variable name need to be put in square brackets?
recStart: consider adding more information on using the square brackets
- Can we only use square brackets? Is a meaninful error thrown if a user tries to include round parentheses in their details sheet?
"the function will not work if there different units between the rows of the same variable..."
- Does the function throw a meaningful error?
"In variableStart, instead of database names being listed, DerivedVar:: is written..."
- I think it should be "variable names" instead of "database names"
DerivedVar::[var1, var2, var3]: Does the order in which you put the variables in the square brackets matter? Is there a limit to the number of parameters you can pass?
How does recodeflow know whether the derived variable function/reference table is available for recoding?
- Does the R script containing the function/reference table definitions need to sit in a specific location on the machine?
  - Explanation provided here, but should be moved to a more obvious location: https://github.com/Big-Life-Lab/recodeflow/blob/04bcea058c7f6e99fa6ef263b16b2b079e8be56c/vignettes/derived_variables.Rmd#L54-L59

reikookamoto commented 1 year ago

missingdata(tagged_na).Rmd

Avoid parentheses in file names
"Base R supports only one type of NA..."
- Consider rephrasing this since base R includes different types of NA (e.g., NA_integer_, NA_real_, NA_character_, NA_complex_)
_"Summary of tagged_na values and their corresponding category values."_
- Where did these category values come from? Is this some sort of domain knowledge?
Is it necessary to include the code chunk when the examples provided are similar to the similar to those in the haven documentation
- Include a link instead - https://haven.tidyverse.org/reference/tagged_na.html

reikookamoto commented 12 months ago

rec_with_table.Rmd

I think most of this information should be moved into the function documentation

reikookamoto commented 12 months ago

how_to_recode.Rmd

Example 1
- Why is it necessary to pass var_labels = c(sex = "sex")
  - We know var_labels is an optional argument, but what are the circumstances under which you want to use it?
Example 2
- Why is it necessary to pass var_labels = c(sex = "Sex") (and why is it capitalized this time round)
- Why do we need to call set_data_labels() (and what does it mean for label to be lost)?
- Why do we want labelled data?
Example 3
- Why do we call bind_rows() without calling set_data_labels()?
Example 4
- Same questions as example 2
Example 5
- The print outs take up a lot of space; it's hard to see where the second rec_with_table() call begins
- Again, why are calling set_data_labels()?
Example 6
- Why are not using set_data_labels() in this example?
Example 7
- Derived variables are discussed across multiple vignettes; is it better for these pieces of information to be consolidated in a single file?
- Is chol * bili an actual measurement? Otherwise, could we provide a different example that deals with a measurement used in practice?
What other common operations should be included in the vignette?

reikookamoto commented 12 months ago

how_to_use_recodeflow_with_your_data.Rmd

This file contains similar information as variable_details.Rmd and variables_sheet.Rmd - there appears to be fewer grammatical errors in this version
This vignette suggests that the recoding process should be specified on a details sheet before organizing the variables sheet; if this is the suggested workflow, this information should be placed in a higher-level overview

DougManuel commented 12 months ago

In general,

The suggestions look great. I suggest addressing anything that looks uncontroversial (reducing redundancy, fixing grammar, clarifying...) and isn't a breaking change. You could PR the changes and get feedback on specifics.

Another way of replying is saying there are quite a few comments and suggestions and so it is difficult to respond to all of them in this thread. Let's have separate issue discussions for a few questions.

That said, here is a few specific responses.

We decided to use tidyverse style guide. Let's add that to the 'contributing' section or somewhere in the documentation/repo. We have camelCase for variable and variableDetails, which could be kept as is because it is a distinct file, and we felt changing that would be too much of a breaking change.
I suggest using gt for tables. GT has many similar features and it is maintained by Rstudio.
Can fields be left blank in sheets? Good question. It requires further discussion, I think. My first thought is 'no' for most. 'notes' and some fields seem clearly optional. I am happy to be opinionated and say labels need to be completed, for instance. After all, you can't make a git commit without a subject -- and we are all happy about that.
we need better data type checks just about everywhere, IMO. In the sheets, derived functions, etc.