Closed Leah-Coffin closed 1 month ago
What does your code look like for that step in your functions-demo.rmd vs salary-report?
Has the first name been added back to the data frame successfully before trying to merge?
To debug the salary report you can always load the data there directly and walk through the document one chunk at a time to find the problem:
# FOR MANUAL TESTING ONLY
# URL <- "https://raw.githubusercontent.com/Watts-College/paf-514-template/main/labs/batch-demo/asu-salaries-2020.csv"
# d <- read.csv( URL )
This is part 1, step 3 in my functions demo where I added gender and added it to the data frame (and I am able to see gender and first.name in the data frame)
#part 1 step 3
add_gender <- function(d) {
# Build gender table for the names
gender_table <- gender(unique.first.names)
# Merge gender information with the original data
merged_data <- merge.data.frame(d, gender_table, by.x = "first.name", by.y = "name", all.x = TRUE)
# Assign "uncoded" to names with missing genders
merged_data$gender[ is.na(merged_data$gender) ] <- "uncoded"
merged_data$gender <- factor( merged_data$gender, levels=c("male","female","uncoded") )
return(merged_data)
}
d <- add_gender(d)
When I run manual testing in the salary-report, I get the following error: Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
(1) I'm assuming you checked that d has class() data.frame and includes the column "first.name" before passing it to the function? Sometimes the objects get mangled in prior steps - you did not provide a reproducible example so I can't inspect your workflow. Those are easy things to eliminate from the list.
(2) Recall that function are input-output machines. What is the information the function needs? You are currently passing it only d, but I see that you then reference unique.first.names
, which has not been defined elsewhere inside the function. You should be passing that explicitly if you are not creating it inside the function:
# create inside the function from d
unique.first.names <- unique( d$first.name )
# pass to the function
add_gender <- function( d, unique.first.names )
The base R function name is merge(), not merge.data.frame().
I am not sure where you found that function? It's the S3 instance of merge, which is fairly advanced functionality in R that is usually only documented in the technical R programming manual. You don't see it used often. Not important, but I am curious where you came across it?
I suspect that is your issue, although I am not entirely clear on how functions behave when you reference their S3 instances directly. Typically you just call functions like print(), summary(), or merge() and R will call the specific S3 instance in the background that were created for a specific object type: print.data.frame(), summary.data.frame(), or merge.data.frame().
https://stat.ethz.ch/R-manual/R-patched/library/base/html/merge.html
Side note - GitHub speaks markdown, so you format code here the same way you do in R Markdown documents - with "fences":
x <- 1:10 y <- 2*x + rnorm(100) plot( x, y )
# some code
x <- 1:10
y <- 2*x + rnorm(100)
plot( x, y )
If you reference the language directly it will add syntax highlighting specific to the language:
```r
# some code
x <- 1:10
y <- 2*x + rnorm(100)
plot( x, y )
```r
# some code
x <- 1:10
y <- 2*x + rnorm(100)
plot( x, y )
The only difference is that R Markdown adds the squiggly brackets so you can add chunk arguments:
```{r}
```{r, results="asis"}
To make this example reproducible you would need to include a demo data frame:
# IN YOUR CODE:
# read data
# wrangling up to current step
# make the example reproducible
# by including the data as it appears
# AT THIS POINT in your code:
head(d) %>% dput() # paste output into the question:
d <-
structure(list(Calendar.Year = c(2020L, 2020L, 2020L, 2020L,
2020L, 2020L), Full.Name = c("ABBASI, Mohammad", "ARQUIZA, Jose Maria Reynaldo Apollo",
"Aaberg, Kelsea", "Abadjivor, Enyah", "Abayesu, Precious", "Abbas, James"
), Job.Description = c("Research/Lab Assistant", "Lecturer",
"Student Support Specialist", "Project Manager", "Management Intern",
"Assoc Professor"), Department.Description = c("Sch Biological & Hlth Sys Engr",
"Sch Biological & Hlth Sys Engr", "Admission Services", "CASGE Tempe",
"Health & Clinical Partnerships", "Sch Biological & Hlth Sys Engr"
), Salary = c("$35,090.00", "$71,400.00", "$36,000.00", "$64,000.00",
"$20,800.00", "$107,195.00"), FTE = c(100L, 100L, 100L, 100L,
50L, 100L), first.name = c("Mohammad", "Jose", "Kelsea", "Enyah",
"Precious", "James")), row.names = c(NA, 6L), class = "data.frame")
add_gender <- function(d) {
# Build gender table for the names
gender_table <- gender(unique.first.names)
# Merge gender information with the original data
merged_data <- merge.data.frame(d, gender_table, by.x = "first.name", by.y = "name", all.x = TRUE)
# Assign "uncoded" to names with missing genders
merged_data$gender[ is.na(merged_data$gender) ] <- "uncoded"
merged_data$gender <- factor( merged_data$gender, levels=c("male","female","uncoded") )
return(merged_data)
}
d <- add_gender(d)
The dput(() function prints code that can be read directly to recreate the R object. Just make sure to include a sample of the data frame, not the entire thing (head returns the first six rows).
Your colleagues can now help you identify the bug because they will have all of the information that is producing the error at that point in your code.
I'm emphasizing this point here because it's a useful skill for collaborating in teams, and also a professional norm for asking questions on data science discussion boards like Stack Overflow, R Open Science, Posit, Tidyverse package help pages, etc.
# dput() prints a version of the object that can
# be read back into R directly to reconstruct
# the original:
d <-
structure(list(Calendar.Year = c(2020L, 2020L, 2020L, 2020L,
2020L, 2020L), Full.Name = c("ABBASI, Mohammad", "ARQUIZA, Jose Maria Reynaldo Apollo",
"Aaberg, Kelsea", "Abadjivor, Enyah", "Abayesu, Precious", "Abbas, James"
), Job.Description = c("Research/Lab Assistant", "Lecturer",
"Student Support Specialist", "Project Manager", "Management Intern",
"Assoc Professor"), Department.Description = c("Sch Biological & Hlth Sys Engr",
"Sch Biological & Hlth Sys Engr", "Admission Services", "CASGE Tempe",
"Health & Clinical Partnerships", "Sch Biological & Hlth Sys Engr"
), Salary = c("$35,090.00", "$71,400.00", "$36,000.00", "$64,000.00",
"$20,800.00", "$107,195.00"), FTE = c(100L, 100L, 100L, 100L,
50L, 100L), first.name = c("Mohammad", "Jose", "Kelsea", "Enyah",
"Precious", "James")), row.names = c(NA, 6L), class = "data.frame")
pander::pander( d )
---------------------------------------------------------------------------
Calendar.Year Full.Name Job.Description
--------------- ------------------------------ ----------------------------
2020 ABBASI, Mohammad Research/Lab Assistant
2020 ARQUIZA, Jose Maria Reynaldo Lecturer
Apollo
2020 Aaberg, Kelsea Student Support Specialist
2020 Abadjivor, Enyah Project Manager
2020 Abayesu, Precious Management Intern
2020 Abbas, James Assoc Professor
---------------------------------------------------------------------------
Table: Table continues below
-----------------------------------------------------------------
Department.Description Salary FTE first.name
-------------------------------- ------------- ----- ------------
Sch Biological & Hlth Sys Engr $35,090.00 100 Mohammad
Sch Biological & Hlth Sys Engr $71,400.00 100 Jose
Admission Services $36,000.00 100 Kelsea
CASGE Tempe $64,000.00 100 Enyah
Health & Clinical Partnerships $20,800.00 50 Precious
Sch Biological & Hlth Sys Engr $107,195.00 100 James
-----------------------------------------------------------------
@lecy
When I run a batch for 2020 data, I get the error below. There were no errors in the functions-demo when I created and tested the add_gender.
This is the line from utils.R:
It works fine in my functions-demo.rmd.