Open reikookamoto opened 1 year ago
# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)
install.packages()
elderflow
)
variables
was created in the rendered documentation
variables.csv
but there's no indication that a CSV file was importedfilter(subject == "lab")
to see what's expectedsection
and subject
?variable_details
was created in the rendered documentation
variable_details.csv
but there's no indication that a CSV file was importedvariable_details
gives instructions to recode a single category of a final variable"_
variableStart
: "If the variable name in a particular dataset is different from the recoded variable name..."
recStart
: consider adding more information on using the square brackets
variableStart
, instead of database names being listed, DerivedVar:: is written..."
DerivedVar::[var1, var2, var3]
: Does the order in which you put the variables in the square brackets matter? Is there a limit to the number of parameters you can pass?NA_integer_
, NA_real_
, NA_character_
, NA_complex_
)tagged_na
values and their corresponding category values."_
haven
documentation
var_labels = c(sex = "sex")
var_labels
is an optional argument, but what are the circumstances under which you want to use it? var_labels = c(sex = "Sex")
(and why is it capitalized this time round)set_data_labels()
(and what does it mean for label to be lost)?bind_rows()
without calling set_data_labels()
?rec_with_table()
call beginsset_data_labels()
?set_data_labels()
in this example?chol * bili
an actual measurement? Otherwise, could we provide a different example that deals with a measurement used in practice?In general,
The suggestions look great. I suggest addressing anything that looks uncontroversial (reducing redundancy, fixing grammar, clarifying...) and isn't a breaking change. You could PR the changes and get feedback on specifics.
Another way of replying is saying there are quite a few comments and suggestions and so it is difficult to respond to all of them in this thread. Let's have separate issue discussions for a few questions.
That said, here is a few specific responses.
- Adopt conventions when writing variable, function, and package names in addition to file paths
Clarify terminology
- What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
tester1
andtester2
should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data- Clarify the order in which a new user should read the vignettes
- Proofread for grammatical and spelling errors
- Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)
- Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)
DT package (used across several files) is in maintenance-only mode so consider using another package for making tables
Questions that crossed my mind
- Can fields be left blank in the sheets?
Are there data type checks that confirm that the data in the sheets are of the correct data type?
- E.g., we wouldn't expect numeric data in the
label
column of the variables sheet- Under what circumstances does the details sheet get updated after calling
rec_with_table()
?
tester1
and tester2
already being referred to as datasets? At least that's what I'm seeing in the how_to_recode vignette on master. In general, I believe we use dataset
and database
interchangeably within the context recodeflow.NA
instead of blanks. how_to_install.Rmd
- Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)
- Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:
# install.packages("devtools") devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)
If we put our package on the R-universe, users can install the development version simply using
install.packages()
These suggestion sound good. With runiverse, how would users install the development version?
variables_sheet.Rmd
- State the source of the sample data
State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g.,
elderflow
)
- Provide the expected data types for each scenario
Unclear how
variables
was created in the rendered documentation
- The variable sheet is referred to as
variables.csv
but there's no indication that a CSV file was imported"Try sorting the subject column by clicking the up..."
- The top 10 rows of the table doesn't show what's expected according to the vignette
- You'd need to
filter(subject == "lab")
to see what's expected- Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?
- Can we clarify the difference between
section
andsubject
?- Can the interactive table take up less space?
The section "Derived Variables" looks incomplete
- Complete it or at least define what a derived variable is and then point users to another vignette
- You can link one vignette to another - https://r-pkgs.org/vignettes.html#links
variable_details.Rmd
- Explicitly state purpose of vignette (i.e., how to organize a details sheet)
Unclear how
variable_details
was created in the rendered documentation
- The sheet is referred to as
variable_details.csv
but there's no indication that a CSV file was imported_"Each row in
variable_details
gives instructions to recode a single category of a final variable"_
- This statement doesn't apply to non-categorical variables
Can we refrain from using the dollar sign "$" to refer to a variable in the dataset?
- It's more likely that users will interact with these sheets via Excel (i.e., using base R syntax may be confusing for novice users of the package)
Can the tables take up less space?
- Lots of scrolling is required to read through the vignette
- Can we also exclude the row ID values?
As a novice user, I found it difficult to wrap my head around the information included in Table 1
- It's helpful to know (1) which columns are optional and (2) that the order in which the columns appear don't matter but are the other pieces of information necessary?
variableStart
: "If the variable name in a particular dataset is different from the recoded variable name..."
- Does the original variable name need to be put in square brackets?
recStart
: consider adding more information on using the square brackets
- Can we only use square brackets? Is a meaninful error thrown if a user tries to include round parentheses in their details sheet?
"the function will not work if there different units between the rows of the same variable..."
- Does the function throw a meaningful error?
"In
variableStart
, instead of database names being listed, DerivedVar:: is written..."
- I think it should be "variable names" instead of "database names"
DerivedVar::[var1, var2, var3]
: Does the order in which you put the variables in the square brackets matter? Is there a limit to the number of parameters you can pass?How does recodeflow know whether the derived variable function/reference table is available for recoding?
Does the R script containing the function/reference table definitions need to sit in a specific location on the machine?
Explanation provided here, but should be moved to a more obvious location: https://github.com/Big-Life-Lab/recodeflow/blob/04bcea058c7f6e99fa6ef263b16b2b079e8be56c/vignettes/derived_variables.Rmd#L54-L59
databaseStart
column for a row is db1; db2; db3; db4
and the variableStart
column is db1::var_1; db2::var_3; [var_3]
. This is effectively saying, for db1 the start variable is var_1, for db2 the start variable is var_2, for all other databases (db3, db4) use var_3.recStart
would be (1,4]
missingdata(tagged_na).Rmd
- Avoid parentheses in file names
"Base R supports only one type of NA..."
- Consider rephrasing this since base R includes different types of NA (e.g.,
NA_integer_
,NA_real_
,NA_character_
,NA_complex_
)_"Summary of
tagged_na
values and their corresponding category values."_
- Where did these category values come from? Is this some sort of domain knowledge?
Is it necessary to include the code chunk when the examples provided are similar to the similar to those in the
haven
documentation
- Include a link instead - https://haven.tidyverse.org/reference/tagged_na.html
how_to_recode.Rmd
Example 1
Why is it necessary to pass
var_labels = c(sex = "sex")
We know
var_labels
is an optional argument, but what are the circumstances under which you want to use it?Example 2
- Why is it necessary to pass
var_labels = c(sex = "Sex")
(and why is it capitalized this time round)- Why do we need to call
set_data_labels()
(and what does it mean for label to be lost)?- Why do we want labelled data?
Example 3
- Why do we call
bind_rows()
without callingset_data_labels()
?Example 4
- Same questions as example 2
Example 5
- The print outs take up a lot of space; it's hard to see where the second
rec_with_table()
call begins- Again, why are calling
set_data_labels()
?Example 6
- Why are not using
set_data_labels()
in this example?Example 7
- Derived variables are discussed across multiple vignettes; is it better for these pieces of information to be consolidated in a single file?
- Is
chol * bili
an actual measurement? Otherwise, could we provide a different example that deals with a measurement used in practice?- What other common operations should be included in the vignette?
dplyr
function. Hence this function.
- Adopt conventions when writing variable, function, and package names in addition to file paths
Clarify terminology
- What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
tester1
andtester2
should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data- Clarify the order in which a new user should read the vignettes
- Proofread for grammatical and spelling errors
- Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)
- Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)
DT package (used across several files) is in maintenance-only mode so consider using another package for making tables
Questions that crossed my mind
- Can fields be left blank in the sheets?
Are there data type checks that confirm that the data in the sheets are of the correct data type?
- E.g., we wouldn't expect numeric data in the
label
column of the variables sheet- Under what circumstances does the details sheet get updated after calling
rec_with_table()
?
- For the package name convention, did you mean for the flow style packages?
- What convention were you thinking for file paths?
- Aren't
tester1
andtester2
already being referred to as datasets? At least that's what I'm seeing in the how_to_recode vignette on master. In general, I believe we usedataset
anddatabase
interchangeably within the context recodeflow.- For the redundancy example, did you mean the text at the bottom of the how_to_use_recodeflow vignette with the columns for the variables sheet?
- re blank values, we use the convention of encoding missing values with
NA
instead of blanks.- re data type checks, we do not have any kind of validation for the data in the sheets but we should definitely have some.
- re updating the details sheet during a recode call, ideally never. The sheets should be read-only during a recode process.
dplyr
versus dplyrvariable
filled in but label
, labelLong
, etc. are not provided for a given row?how_to_install.Rmd
- Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)
- Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:
# install.packages("devtools") devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)
If we put our package on the R-universe, users can install the development version simply using
install.packages()
These suggestion sound good. With runiverse, how would users install the development version?
I think it's possible to get R-universe to track and build the development branch of a particular GitHub repo: https://ropensci.org/blog/2021/06/22/setup-runiverse/#pro-tip-tracking-custom-branches-or-releases
variables_sheet.Rmd
- State the source of the sample data
State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g.,
elderflow
)
- Provide the expected data types for each scenario
Unclear how
variables
was created in the rendered documentation
- The variable sheet is referred to as
variables.csv
but there's no indication that a CSV file was imported"Try sorting the subject column by clicking the up..."
- The top 10 rows of the table doesn't show what's expected according to the vignette
- You'd need to
filter(subject == "lab")
to see what's expected- Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?
- Can we clarify the difference between
section
andsubject
?- Can the interactive table take up less space?
The section "Derived Variables" looks incomplete
- Complete it or at least define what a derived variable is and then point users to another vignette
- You can link one vignette to another - https://r-pkgs.org/vignettes.html#links
- re provide the data types for each scenario, what do you mean by scenario. Do you mean columns?
- re where the variables sheet is coming from, its coming from sourcing the test-data-generator.R file which is done in this line. I'm not sure why this was done but we should move to CSV files.
- re input validation, definitely. Ideally we would have a function that validates the sheets.
readr::read_csv()
or (2) if someone wants to use a variables sheet that's part of a package, they should first install and then load that package. But ignore what I said earlier since the variables sheet is a data.frame
in both scenarios.
tester1
andtester2
should be referred to as datasets rather than databases; the latter usually refers to an organized collection of dataQuestions that crossed my mind
label
column of the variables sheetrec_with_table()
?