Big-Life-Lab / recodeflow

Harmonizing data into a common format.
https://big-life-lab.github.io/recodeflow/
Other
6 stars 1 forks source link

Suggestions for improving the vignettes #35

Open reikookamoto opened 1 year ago

reikookamoto commented 1 year ago

Questions that crossed my mind

reikookamoto commented 1 year ago

how_to_install.Rmd

# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)
reikookamoto commented 1 year ago

variables_sheet.Rmd

reikookamoto commented 1 year ago

variable_details.Rmd

reikookamoto commented 1 year ago

missingdata(tagged_na).Rmd

reikookamoto commented 12 months ago

rec_with_table.Rmd

reikookamoto commented 12 months ago

how_to_recode.Rmd

reikookamoto commented 12 months ago

how_to_use_recodeflow_with_your_data.Rmd

DougManuel commented 12 months ago

In general,

The suggestions look great. I suggest addressing anything that looks uncontroversial (reducing redundancy, fixing grammar, clarifying...) and isn't a breaking change. You could PR the changes and get feedback on specifics.

Another way of replying is saying there are quite a few comments and suggestions and so it is difficult to respond to all of them in this thread. Let's have separate issue discussions for a few questions.

That said, here is a few specific responses.

yulric commented 11 months ago
  • Adopt conventions when writing variable, function, and package names in addition to file paths
  • Clarify terminology

    • What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
    • tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data
  • Clarify the order in which a new user should read the vignettes
  • Proofread for grammatical and spelling errors
  • Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)
  • Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)
  • DT package (used across several files) is in maintenance-only mode so consider using another package for making tables

Questions that crossed my mind

  • Can fields be left blank in the sheets?
  • Are there data type checks that confirm that the data in the sheets are of the correct data type?

    • E.g., we wouldn't expect numeric data in the label column of the variables sheet
  • Under what circumstances does the details sheet get updated after calling rec_with_table()?
yulric commented 11 months ago

how_to_install.Rmd

  • Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)
  • Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:
# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)

These suggestion sound good. With runiverse, how would users install the development version?

yulric commented 11 months ago

variables_sheet.Rmd

  • State the source of the sample data
  • State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g., elderflow)

    • Provide the expected data types for each scenario
  • Unclear how variables was created in the rendered documentation

    • The variable sheet is referred to as variables.csv but there's no indication that a CSV file was imported
  • "Try sorting the subject column by clicking the up..."

    • The top 10 rows of the table doesn't show what's expected according to the vignette
      • You'd need to filter(subject == "lab") to see what's expected
  • Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?
  • Can we clarify the difference between section and subject?
  • Can the interactive table take up less space?
  • The section "Derived Variables" looks incomplete

yulric commented 11 months ago

variable_details.Rmd

  • Explicitly state purpose of vignette (i.e., how to organize a details sheet)
  • Unclear how variable_details was created in the rendered documentation

    • The sheet is referred to as variable_details.csv but there's no indication that a CSV file was imported
  • _"Each row in variable_details gives instructions to recode a single category of a final variable"_

    • This statement doesn't apply to non-categorical variables
  • Can we refrain from using the dollar sign "$" to refer to a variable in the dataset?

    • It's more likely that users will interact with these sheets via Excel (i.e., using base R syntax may be confusing for novice users of the package)
  • Can the tables take up less space?

    • Lots of scrolling is required to read through the vignette
    • Can we also exclude the row ID values?
  • As a novice user, I found it difficult to wrap my head around the information included in Table 1

    • It's helpful to know (1) which columns are optional and (2) that the order in which the columns appear don't matter but are the other pieces of information necessary?
  • variableStart: "If the variable name in a particular dataset is different from the recoded variable name..."

    • Does the original variable name need to be put in square brackets?
  • recStart: consider adding more information on using the square brackets

    • Can we only use square brackets? Is a meaninful error thrown if a user tries to include round parentheses in their details sheet?
  • "the function will not work if there different units between the rows of the same variable..."

    • Does the function throw a meaningful error?
  • "In variableStart, instead of database names being listed, DerivedVar:: is written..."

    • I think it should be "variable names" instead of "database names"
  • DerivedVar::[var1, var2, var3]: Does the order in which you put the variables in the square brackets matter? Is there a limit to the number of parameters you can pass?
  • How does recodeflow know whether the derived variable function/reference table is available for recoding?

yulric commented 11 months ago

missingdata(tagged_na).Rmd

  • Avoid parentheses in file names
  • "Base R supports only one type of NA..."

    • Consider rephrasing this since base R includes different types of NA (e.g., NA_integer_, NA_real_, NA_character_, NA_complex_)
  • _"Summary of tagged_na values and their corresponding category values."_

    • Where did these category values come from? Is this some sort of domain knowledge?
  • Is it necessary to include the code chunk when the examples provided are similar to the similar to those in the haven documentation

yulric commented 11 months ago

how_to_recode.Rmd

  • Example 1

    • Why is it necessary to pass var_labels = c(sex = "sex")

    • We know var_labels is an optional argument, but what are the circumstances under which you want to use it?

  • Example 2

    • Why is it necessary to pass var_labels = c(sex = "Sex") (and why is it capitalized this time round)
    • Why do we need to call set_data_labels() (and what does it mean for label to be lost)?
    • Why do we want labelled data?
  • Example 3

    • Why do we call bind_rows() without calling set_data_labels()?
  • Example 4

    • Same questions as example 2
  • Example 5

    • The print outs take up a lot of space; it's hard to see where the second rec_with_table() call begins
    • Again, why are calling set_data_labels()?
  • Example 6

    • Why are not using set_data_labels() in this example?
  • Example 7

    • Derived variables are discussed across multiple vignettes; is it better for these pieces of information to be consolidated in a single file?
    • Is chol * bili an actual measurement? Otherwise, could we provide a different example that deals with a measurement used in practice?
  • What other common operations should be included in the vignette?
reikookamoto commented 11 months ago
  • Adopt conventions when writing variable, function, and package names in addition to file paths
  • Clarify terminology

    • What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
    • tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data
  • Clarify the order in which a new user should read the vignettes
  • Proofread for grammatical and spelling errors
  • Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)
  • Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)
  • DT package (used across several files) is in maintenance-only mode so consider using another package for making tables

Questions that crossed my mind

  • Can fields be left blank in the sheets?
  • Are there data type checks that confirm that the data in the sheets are of the correct data type?

    • E.g., we wouldn't expect numeric data in the label column of the variables sheet
  • Under what circumstances does the details sheet get updated after calling rec_with_table()?
  • For the package name convention, did you mean for the flow style packages?
  • What convention were you thinking for file paths?
  • Aren't tester1 and tester2 already being referred to as datasets? At least that's what I'm seeing in the how_to_recode vignette on master. In general, I believe we use dataset and database interchangeably within the context recodeflow.
  • For the redundancy example, did you mean the text at the bottom of the how_to_use_recodeflow vignette with the columns for the variables sheet?
  • re blank values, we use the convention of encoding missing values with NA instead of blanks.
  • re data type checks, we do not have any kind of validation for the data in the sheets but we should definitely have some.
  • re updating the details sheet during a recode call, ideally never. The sheets should be read-only during a recode process.
reikookamoto commented 11 months ago

how_to_install.Rmd

  • Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)
  • Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:
# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)

These suggestion sound good. With runiverse, how would users install the development version?

I think it's possible to get R-universe to track and build the development branch of a particular GitHub repo: https://ropensci.org/blog/2021/06/22/setup-runiverse/#pro-tip-tracking-custom-branches-or-releases

reikookamoto commented 11 months ago

variables_sheet.Rmd

  • State the source of the sample data
  • State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g., elderflow)

    • Provide the expected data types for each scenario
  • Unclear how variables was created in the rendered documentation

    • The variable sheet is referred to as variables.csv but there's no indication that a CSV file was imported
  • "Try sorting the subject column by clicking the up..."

    • The top 10 rows of the table doesn't show what's expected according to the vignette
      • You'd need to filter(subject == "lab") to see what's expected
  • Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?
  • Can we clarify the difference between section and subject?
  • Can the interactive table take up less space?
  • The section "Derived Variables" looks incomplete

  • re provide the data types for each scenario, what do you mean by scenario. Do you mean columns?
  • re where the variables sheet is coming from, its coming from sourcing the test-data-generator.R file which is done in this line. I'm not sure why this was done but we should move to CSV files.
  • re input validation, definitely. Ideally we would have a function that validates the sheets.