Watts-College / paf-514-template

https://watts-college.github.io/paf-514-template/
1 stars 0 forks source link

Part 3 Step 3 Creating a Batch #60

Closed Leah-Coffin closed 1 month ago

Leah-Coffin commented 7 months ago

@lecy

When I run a batch for 2020 data, I get the error below. There were no errors in the functions-demo when I created and tested the add_gender.

processing file: salary-report.rmd
  |.......................................................................                                    |  67% [data]           
Quitting from lines 45-65 [data] (salary-report.rmd)
Error in `fix.by()`:
! 'by' must specify a uniquely valid column
Backtrace:
 1. global add_gender(d)
 2. base::merge.data.frame(...)
      at utils.R:21:2
 3. base (local) fix.by(by.x, x)

This is the line from utils.R:

  merged_data <- merge.data.frame(d, gender_table, by.x = "first.name", by.y = "name", all.x = TRUE)

It works fine in my functions-demo.rmd.

lecy commented 7 months ago

What does your code look like for that step in your functions-demo.rmd vs salary-report?

Has the first name been added back to the data frame successfully before trying to merge?

To debug the salary report you can always load the data there directly and walk through the document one chunk at a time to find the problem:

# FOR MANUAL TESTING ONLY
# URL <- "https://raw.githubusercontent.com/Watts-College/paf-514-template/main/labs/batch-demo/asu-salaries-2020.csv"
# d <- read.csv( URL )
Leah-Coffin commented 7 months ago

This is part 1, step 3 in my functions demo where I added gender and added it to the data frame (and I am able to see gender and first.name in the data frame)

#part 1 step 3

add_gender <- function(d) {

  # Build gender table for the names
  gender_table <- gender(unique.first.names)

  # Merge gender information with the original data
  merged_data <- merge.data.frame(d, gender_table, by.x = "first.name", by.y = "name", all.x = TRUE)

  # Assign "uncoded" to names with missing genders
  merged_data$gender[ is.na(merged_data$gender) ] <- "uncoded"
  merged_data$gender <- factor( merged_data$gender, levels=c("male","female","uncoded") )

  return(merged_data)
}
d <- add_gender(d)
Leah-Coffin commented 7 months ago

When I run manual testing in the salary-report, I get the following error: Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

lecy commented 7 months ago

(1) I'm assuming you checked that d has class() data.frame and includes the column "first.name" before passing it to the function? Sometimes the objects get mangled in prior steps - you did not provide a reproducible example so I can't inspect your workflow. Those are easy things to eliminate from the list.

(2) Recall that function are input-output machines. What is the information the function needs? You are currently passing it only d, but I see that you then reference unique.first.names, which has not been defined elsewhere inside the function. You should be passing that explicitly if you are not creating it inside the function:

# create inside the function from d
unique.first.names <- unique( d$first.name )

# pass to the function
add_gender <- function( d, unique.first.names )

The base R function name is merge(), not merge.data.frame().

I am not sure where you found that function? It's the S3 instance of merge, which is fairly advanced functionality in R that is usually only documented in the technical R programming manual. You don't see it used often. Not important, but I am curious where you came across it?

I suspect that is your issue, although I am not entirely clear on how functions behave when you reference their S3 instances directly. Typically you just call functions like print(), summary(), or merge() and R will call the specific S3 instance in the background that were created for a specific object type: print.data.frame(), summary.data.frame(), or merge.data.frame().

https://stat.ethz.ch/R-manual/R-patched/library/base/html/merge.html

lecy commented 7 months ago

Side note - GitHub speaks markdown, so you format code here the same way you do in R Markdown documents - with "fences":

some code

x <- 1:10 y <- 2*x + rnorm(100) plot( x, y )

# some code
x <- 1:10
y <- 2*x + rnorm(100)
plot( x, y )

If you reference the language directly it will add syntax highlighting specific to the language:

```r
# some code
x <- 1:10
y <- 2*x + rnorm(100)
plot( x, y )

```r
# some code
x <- 1:10
y <- 2*x + rnorm(100)
plot( x, y )

The only difference is that R Markdown adds the squiggly brackets so you can add chunk arguments:

```{r}
```{r, results="asis"}
lecy commented 7 months ago

Reproducible Examples

To make this example reproducible you would need to include a demo data frame:

# IN YOUR CODE: 
# read data
# wrangling up to current step

# make the example reproducible 
# by including the data as it appears
# AT THIS POINT in your code: 

head(d) %>% dput()  # paste output into the question: 

d <- 
structure(list(Calendar.Year = c(2020L, 2020L, 2020L, 2020L, 
2020L, 2020L), Full.Name = c("ABBASI, Mohammad", "ARQUIZA, Jose Maria Reynaldo Apollo", 
"Aaberg, Kelsea", "Abadjivor, Enyah", "Abayesu, Precious", "Abbas, James"
), Job.Description = c("Research/Lab Assistant", "Lecturer", 
"Student Support Specialist", "Project Manager", "Management Intern", 
"Assoc Professor"), Department.Description = c("Sch Biological & Hlth Sys Engr", 
"Sch Biological & Hlth Sys Engr", "Admission Services", "CASGE  Tempe", 
"Health & Clinical Partnerships", "Sch Biological & Hlth Sys Engr"
), Salary = c("$35,090.00", "$71,400.00", "$36,000.00", "$64,000.00", 
"$20,800.00", "$107,195.00"), FTE = c(100L, 100L, 100L, 100L, 
50L, 100L), first.name = c("Mohammad", "Jose", "Kelsea", "Enyah", 
"Precious", "James")), row.names = c(NA, 6L), class = "data.frame")

add_gender <- function(d) {

  # Build gender table for the names
  gender_table <- gender(unique.first.names)

  # Merge gender information with the original data
  merged_data <- merge.data.frame(d, gender_table, by.x = "first.name", by.y = "name", all.x = TRUE)

  # Assign "uncoded" to names with missing genders
  merged_data$gender[ is.na(merged_data$gender) ] <- "uncoded"
  merged_data$gender <- factor( merged_data$gender, levels=c("male","female","uncoded") )

  return(merged_data)
}
d <- add_gender(d)

The dput(() function prints code that can be read directly to recreate the R object. Just make sure to include a sample of the data frame, not the entire thing (head returns the first six rows).

Your colleagues can now help you identify the bug because they will have all of the information that is producing the error at that point in your code.

I'm emphasizing this point here because it's a useful skill for collaborating in teams, and also a professional norm for asking questions on data science discussion boards like Stack Overflow, R Open Science, Posit, Tidyverse package help pages, etc.

# dput() prints a version of the object that can
# be read back into R directly to reconstruct
# the original: 

d <- 
structure(list(Calendar.Year = c(2020L, 2020L, 2020L, 2020L, 
2020L, 2020L), Full.Name = c("ABBASI, Mohammad", "ARQUIZA, Jose Maria Reynaldo Apollo", 
"Aaberg, Kelsea", "Abadjivor, Enyah", "Abayesu, Precious", "Abbas, James"
), Job.Description = c("Research/Lab Assistant", "Lecturer", 
"Student Support Specialist", "Project Manager", "Management Intern", 
"Assoc Professor"), Department.Description = c("Sch Biological & Hlth Sys Engr", 
"Sch Biological & Hlth Sys Engr", "Admission Services", "CASGE  Tempe", 
"Health & Clinical Partnerships", "Sch Biological & Hlth Sys Engr"
), Salary = c("$35,090.00", "$71,400.00", "$36,000.00", "$64,000.00", 
"$20,800.00", "$107,195.00"), FTE = c(100L, 100L, 100L, 100L, 
50L, 100L), first.name = c("Mohammad", "Jose", "Kelsea", "Enyah", 
"Precious", "James")), row.names = c(NA, 6L), class = "data.frame")

pander::pander( d )
---------------------------------------------------------------------------
 Calendar.Year            Full.Name                  Job.Description       
--------------- ------------------------------ ----------------------------
     2020              ABBASI, Mohammad           Research/Lab Assistant   

     2020        ARQUIZA, Jose Maria Reynaldo            Lecturer          
                            Apollo                                         

     2020               Aaberg, Kelsea          Student Support Specialist 

     2020              Abadjivor, Enyah              Project Manager       

     2020             Abayesu, Precious             Management Intern      

     2020                Abbas, James                Assoc Professor       
---------------------------------------------------------------------------

Table: Table continues below

-----------------------------------------------------------------
     Department.Description         Salary      FTE   first.name 
-------------------------------- ------------- ----- ------------
 Sch Biological & Hlth Sys Engr   $35,090.00    100    Mohammad  

 Sch Biological & Hlth Sys Engr   $71,400.00    100      Jose    

       Admission Services         $36,000.00    100     Kelsea   

          CASGE Tempe             $64,000.00    100     Enyah    

 Health & Clinical Partnerships   $20,800.00    50     Precious  

 Sch Biological & Hlth Sys Engr   $107,195.00   100     James    
-----------------------------------------------------------------