Watts-College / cpp-527-fall-2021

A course shell for CPP 527 Foundations of Data Science II
https://watts-college.github.io/cpp-527-fall-2021/
2 stars 6 forks source link

Final project step 10 question #75

Open Johaning opened 2 years ago

Johaning commented 2 years ago

For step 10, should we show the 5 faculty with the highest actual salaries, or the highest salaries when adjusted to full time FTE?

lecy commented 2 years ago

It doesn't matter to me.

Although after you mention it, it would be good to footnote that salaries are FTE adjusted.

If this were a real report there would be a whole section on methodology that explains all of the steps.

That is essentially the role that your function-demo.rmd & function-demo.html documents are performing in this assignment - show how you arrive at the results that you present in the report if someone wants to audit the process and inspect all of the assumptions that are embedded in the analysis.

ctmccull commented 2 years ago

@lecy I'm struggling to think through going about step 10. I have some ideas as to selecting the top 5 salaries, and then I'm not sure how to make sure it displays the other variables in the table. Here are my starting ideas.

top_salary_table <- function(d, Salary)
{topsalary <-
  d %>%
 arrange(d, desc(Salary)) %>%
  slice(1:5)

dollarize(topsalary)

return(topsalary)
}

Alternatively...

top_salaries <- function(d, Salary)
  {topsalary <-
  d%>%
  filter(d, row_number(desc(Salary)) <= 5)
dollarize(topsalary)
return(topsalary)
}

Are these close? My next guess is to add group_by() to make sure name, gender, and position are where they need to be.

lecy commented 2 years ago

First things first.

For dplyr functions the first argument is always the data frame:

filter( d, salary > 100000 )

If you are using pipes you don't include the data frame in the function - it is passed as the first argument by the pipe:

d %>% filter( salary > 100000 )

You are doing both:

d %>%
  arrange( d, desc(Salary) )

It will run but might not do what you expect.

Second, which ingredients do you need for this recipe?

The Salary variable lives inside of the data frame d so you don't need to pass both.

I would also include a select() step to pick a small number of variables to report in this table. I think there were four in the example.

# top_salary_table <- function(d, Salary)
top_salary_table <- function( d )
{
  topsalary <-
    d %>%
    select( Full.Name, Salary, etc ) %>% 
    arrange( d, desc(Salary) ) %>%
    slice(1:5)

  return( topsalary )
}

Regarding table sorting, you are committing one important error here. Note that Salary is the raw text version of income and salary is the numeric version.

Will Salary sort correctly? What's happening here?

x <-
c("$35,090.00", "$71,400.00", "$36,000.00", "$64,000.00", "$20,800.00", 
"$107,195.00", "$147,225.00", "$98,426.00", "$89,406.00", "$95,095.00" )

> sort( x, decreasing=TRUE )
 [1] "$98,426.00"  "$95,095.00"  "$89,406.00"  "$71,400.00"  "$64,000.00" 
 [6] "$36,000.00"  "$35,090.00"  "$20,800.00"  "$147,225.00" "$107,195.00"
> 

Recall the difference between sorting numbers (arrange by magnitude) and alphabetization.

Make sure you are using the numeric version of salary. Now the dollarize formatting will make more sense. But note that you can only apply it after sorting (arranging) so that you are not alphabetizing salaries instead of sorting them.

top_salary_table <- function( d )
{
  d.top5 <-
    d %>%
    select( Full.Name, salary, etc ) %>% 
    arrange( d, desc(salary) ) %>%
    slice(1:5)

  # add formatting to numeric vector 
  d.top5$salary <- dollarize( d.top5$salary )

  return( d.top5 )
}

Note that dollarize is expecting a numeric vector and will return a character vector. You were trying to pass a data frame, not a vector in your example above:

dollarize( topsalary )
lecy commented 2 years ago

Keeping track of data types is a small thing and also the biggest thing!

A reminder that it is a very good habit to develop to come up with consistent object names so that you are always aware of what object types you are working with.

# data frames 
d
df
dat

# subset of a data frame
d2
d.male
dat.top5

# vectors (argument names) 
x, y
v
x.string <- as.character(x)
x.num <- as.numeric(x)

# logical vectors (defining groups)
these
these.female <- gender == "female"

# tables 
t <- table( x, y )
t.salary.by.gender <- d %>% group_by( gender ) %>% summarize( ave=mean(salary) )

# regression models 
m
m1
m2

For example, when I look back at functions I can see right away what sort of data they expect and what I get back.

# expecting a vector, returning a vector 
dollarize <- function(x)
{ 
  x <- paste0("$", format( round( x, 0 ), big.mark="," ) ) 
  return(x)
}

# proper usage
d$salary <- dollarize( d$salary )

# improper use
d <- dollarize( d)

# expecting data frame, returning table 
create_table <- function( d )
{
  t <- d %>% count( race, age )
  return(  t )
}

Once you are working with a larger project like this you can see the importance of good housekeeping / organizational skills.

Data types are not trivial. Creating a habit of naming objects by data type will force you to be aware of your objects as well as pay attention to proper argument types and return values when creating functions.

sandralili commented 2 years ago

Hello Dr. @lecy

This question is for the "Put it All Together" part inside Step 10. This is my loop for isolating data from each unit, to create the graphic and the 2 tables.


# STEP 10c - Reporting by each department

for( i in academic.units )
{
  cat( paste('<h1>', i ,'</h1>' ) )

  d5 %>%  
    filter( Department.Description == i ) %>% 
    create_salary_table() %>% 
    build_graph( unit=i )  

### PAY RANGE BY RANK AND GENDER  

  cat( paste('<h3>', 'PAY RANGE BY RANK AND GENDER' ,'</h3>' ) )

  t.salary <- 
  d5 %>% 
    filter( Department.Description == i ) %>% 
    create_salary_table()

  cat( t.salary %>% knitr::kable(format="html") )

  cat( paste('<h3>', 'PAY RANGE BY RANK AND GENDER' ,'</h3>' ) )

### TOP FIVE SALARIES

  cat( paste('<h3>', 'TOP FIVE SALARIES' ,'</h3>' ) )
  t.salary <- 
  d5 %>% 
    filter( Department.Description == i ) %>% 
    top_salaries()
    dollarize()

  cat( t.salary %>% knitr::kable(format="html") )

  cat( paste('<h3>', 'PAY RANGE BY RANK AND GENDER' ,'</h3>' ) )

  cat( '<br><hr><br>' )

 }  

This is the error I am getting: _Error: Problem with summarise() column q25. i q25 = quantile(salary, 0.25). x non-numeric argument to binary operator i The error occurred in group 1: title = Full Professor, gender = "female". Run rlang::last_error() to see where the error occurred._

Also, this chunk of code will be inside of the salary-report.rmd file, correct? Thanks!

lecy commented 2 years ago

Maybe trying to dollarize a data frame?

 t.salary <- 
  d5 %>% 
    filter( Department.Description == i ) %>% 
    top_salaries()
    dollarize()

That is applied to a vector, not a data frame:

# add formatting to numeric vector 
  d.top5$salary <- dollarize( d.top5$salary )

Otherwise it looks mostly fine. Do you know which unit it gets stuck on? The first time through the loop (code is not working at all) or it runs fine until it hits a specific unit?

Yes, the loop will be inside the salary-report.rmd doc as well.

lecy commented 2 years ago

Error: Problem with summarise() column q25. i q25 = quantile(salary, 0.25). x non-numeric argument to binary operator

That would only occur in this function:

t.salary <- 
  d5 %>% 
    filter( Department.Description == i ) %>% 
    create_salary_table()

For more insight it would be helpful to share a peek at d5 and the function.

sandralili commented 2 years ago

Dr. @lecy I attempted to run the code with the loop before using the batch.R file, to fix any mistakes. It did not generate any graphic or tables, so the code is not working at all :(

Before adding the loop, these functions were working well:


# STEP 10a - Getting top five salaries

top_salaries <- function (d5)
{

 higher <- d5 [order(-d5$salary),]
 return(higher)
}

higher <- top_salaries (d5)
d5 <- higher

head (d5[c(3,4,10,12)],5) %>% pander()

# STEP 10b - Changing from numeric to "currency" type

dollarize <- function(x)
{ 

  paste0("$", format( round( x, 0 ), big.mark="," ) ) 

}

d5$salary <- dollarize ( x=d5$salary )

head (d5[c(3,4,10,12)],5) %>% pander()

image

These are the columns on d5: colnames(d5) [1] "first.name" "Calendar.Year" "Full.Name" "Job.Description" "Department.Description" "Salary"
[7] "FTE" "proportion_male" "proportion_female" "gender" "title" "salary"

sandralili commented 2 years ago

Good Morning Dr. @lecy,

I am running my code without the batch file first, to see if works. When R is running the utils.R file to call the functions, I got the following error:

image

Not sure what's going on. I had created the required functions and I have some code that is not in a function. But I have done it in previous rows and it looks like I don't have any problem at all.

(this is code in salary-report)

d5 <- 
  d4 %>% 
  filter( title != "" & ! is.na(title) ) %>% 
  filter( Department.Description %in% academic.units ) %>% 
  arrange( Department.Description, title )

# STEP 8 - Summarizing Salaries
t.salary <- create_salary_table(d5)

# STEP 9 - Creating Graphics with results

# STEP 10a - Getting top five salaries 
top_salaries  (d5)

# STEP 10b - Changing from numeric to "currency" type
d5$salary <- dollarize ( x=d5$salary )

This is in utils.R:
## top_salaries ()
top_salaries <- function (d5)
{

  higher <- d5 [order(-d5$salary),]
  return(higher)
}

higher <- top_salaries (d5)
d5 <- higher

head (d5[c(3,4,10,12)],5) %>% pander()
lecy commented 2 years ago

You source utils.R at the start of the RMD. Sourcing a file runs the script from top to bottom.

When it gets to this line it will not know where to find d5 since you are not sending data to utils.

# This is in utils.R:
higher <- top_salaries (d5)
d5 <- higher

It should be strictly functions in utils. Execute the functions in the RMD to wrangle the data there.

lecy commented 2 years ago

These sections should be bundled into one function:


# top_salaries <- function (d5)
# {
#   
#   higher <- d5 [order(-d5$salary),]
#   return(higher)
# }
# 
# head (d5[c(3,4,10,12)],5) %>% pander()

top_salaries <- function( d )
{
  d <- d[ order( -d$salary ), ]
  head ( d[ c(3,4,10,12) ], 5 )
  return( d )
}

# USE: 
# top_salaries (d5) %>% pander()

Also note that you will likely run into issues using relative column position references since you are adding columns to the data frame. If you change the order of any steps you might end up selecting the wrong columns.

If you select by names and the order changes it won't matter.

d[ c("Full.Name","salary",...) ]  # don't need a comma when selecting column names 
select( d, Full.Name, salary, etc. )  # dplyr version 
sandralili commented 2 years ago

It worked thank you! Last night I forgot to delete those line codes. Oh I see, I will replace columns numbers by their names, Thanks.

For the Reporting by each department loop., I am getting this error, not sure if I am using the curly brackets correctly or maybe I am missing the "else" function?

image

# STEP 10c - Reporting by each department

for( i in academic.units )
{

  cat( paste('<h1>', i ,'</h1>' ) )

  d5 %>%  
    filter( Department.Description == i ) %>% 
    if( nrow(d5) == 0 ) 
    {   next }
       create_salary_table() %>% 
       build_graph( unit=i )  

### PAY RANGE BY RANK AND GENDER  

  cat( paste('<h3>', 'PAY RANGE BY RANK AND GENDER' ,'</h3>' ) )

  t.salary <- 
  d5 %>% 
    filter( Department.Description == i ) %>% 
    create_salary_table()

  cat( t.salary %>% knitr::kable(format="html") )

  cat( paste('<h3>', 'PAY RANGE BY RANK AND GENDER' ,'</h3>' ) )

### TOP FIVE SALARIES

  cat( paste('<h3>', 'TOP FIVE SALARIES' ,'</h3>' ) )
  t.salary <- 
  d5 %>% 
    filter( Department.Description == i ) %>% 
    top_salaries()
    dollarize()

  cat( t.salary %>% knitr::kable(format="html") )

  cat( paste('<h3>', 'PAY RANGE BY RANK AND GENDER' ,'</h3>' ) )

  cat( '<br><hr><br>' )
}
lecy commented 2 years ago

You can't pipe a data frame to an if statement. Check the data frame after selecting by unit.

for( i in academic.units )
{

  d6 <- 
    d5 %>%  
    filter( Department.Description == i ) 

    if( nrow(d6) == 0 ) 
    {   next }

  d6 %>% 
       create_salary_table() %>% 
       build_graph( unit=i )  

... }
sandralili commented 2 years ago

Thanks, it worked!

sandralili commented 2 years ago

Dr. @lecy My salary-report file is just printing this code, it is not printing the tables or the graphic.


# STEP 10c - Reporting by each department

for( i in academic.units )
{

  d6 <- 
    d5 %>%  
    filter( Department.Description == i ) 

    if( nrow(d6) == 0 ) 
    { 
      next 
    }

    d6 %>% 
       create_salary_table() %>% 
       build_graph( unit=i )  

### PAY RANGE BY RANK AND GENDER  

  cat( paste('<h3>', 'PAY RANGE BY RANK AND GENDER' ,'</h3>' ) )

  t.salary <- 
  d5 %>% 
    filter( Department.Description == i ) %>% 
    create_salary_table()

  cat( t.salary %>% knitr::kable(format="html") )

  cat( paste('<h3>', 'PAY RANGE BY RANK AND GENDER' ,'</h3>' ) )

### TOP FIVE SALARIES

  cat( paste('<h3>', 'TOP FIVE SALARIES' ,'</h3>' ) )
  t.salary <- 
  d5 %>% 
    filter( Department.Description == i ) %>% 
    top_salaries()
    dollarize()

  cat( t.salary %>% knitr::kable(format="html") )

  cat( paste('<h3>', 'PAY RANGE BY RANK AND GENDER' ,'</h3>' ) )

  cat( '<br><hr><br>' )
}

# pretty tables 

<style>

td {
    padding: 3px 15px 3px 15px;
    text-align: center !important;
}

th {
    padding: 3px 15px 3px 15px;
    text-align: center !important;
    font-weight: bold;
    color: SlateGray !important;
}

</style>

The d5 is not empty, not sure if I placed the "results='asis'" in the right place

image image

lecy commented 2 years ago

My salary-report file is just printing this code, it is not printing the tables or the graphic.

After knitting? Or are you previewing in the RMD doc?

lecy commented 2 years ago

Try knitting to see how it looks.

sandralili commented 2 years ago

Dr. @lecy, After knitting. The report is empty, it is very weird. This is all that I got:

image

image

lecy commented 2 years ago

What are your chunk arguments?

sandralili commented 2 years ago

Dr @lecy , I know I still need to customize the yaml, and the URL part for the batch file. This is what I have:

---
title: "salary-report.rmd"
author: "Sandra Perez"
date: "10/8/2021"
output:
  html_document:
    df_print: paged
    theme: cerulean
    highlight: haddock
---

knitr::opts_chunk$set(echo = TRUE)

# STEP 1 - Getting data from site.

source( 'utils.R' )   # load custom functions

URL <- 'https://docs.google.com/spreadsheets/d/1RoiO9bfpbXowprWdZrgtYXG9_WuK3NFemwlvDGdym7E/export?gid=1335284952&format=csv'
d <- read.csv( URL )

head( d) %>% pander()
lecy commented 2 years ago

I meant the chunk arguments for the code chunk with the loop.

Chunk arguments:

{r, include=FALSE}

Here's what your YAML should be:

---
title: "ASU Salary Report"
author: "Sandra Perez"
date: "10/8/2021"
output:
  html_document:
    df_print: paged
    theme: cerulean
    highlight: haddock
    toc: true
params:
  url:
    value: x
---
lecy commented 2 years ago

Loop chunk arguments should be:

{r, fig.height=7, fig.width=10, results="asis"}
sandralili commented 2 years ago

Dr. @lecy I did have that in the loop chunk, and I fixed the Yaml. It's still not working

{results=‘asis’, fig.height=7, fig.width=10}
# STEP 10c - Reporting by each department

for( i in academic.units )
{

  d6 <- 
    d5 %>%  
    filter( Department.Description == i ) 

    if( nrow(d6) == 0 ) 
    { 
      next 
....
lecy commented 2 years ago

Do you have illegal quote marks here? I can't tell.

‘asis’
lecy commented 2 years ago

One small request, please place fences around your code so it is readable:

fences

image

sandralili commented 2 years ago

My apologies professor @lecy


# STEP 10c - Reporting by each department

for( i in academic.units )
{

  d6 <- 
    d5 %>%  
    filter( Department.Description == i ) 

    if( nrow(d6) == 0 ) 
    { 
      next 
    }

    d6 %>% 
       create_salary_table() %>% 
       build_graph( unit=i )  

image

sandralili commented 2 years ago

Dr. @lecy, yes, when I run the code without knitting this is the message:

'"results=`asis'"' is not recognized as an internal or external command, operable program or batch file.

lecy commented 2 years ago

What I meant is you are using the wrong quote marks around the argument.

Do you see the difference here?

results=`asis'  # wrong
results='asis'  # right 

It's one of the reasons it's important to use code formatting - you can't spot these things easily as regular text, and you can't copy and paste regular text into code files because they can contain these stylized typesetting characters that break your code.

sandralili commented 2 years ago

Dr @lecy Thanks I would not have seen it even with glasses :)

I do apologize for so many errors I am getting Once I fixed the 'axis', this is what I got now. I feel I am going backwards :( I got this message when I didn't have the yaml fixed.

image

These are the traceback steps (when I run the code without knitting):

Error: Problem with summarise() column q25. i q25 = quantile(salary, 0.25). x non-numeric argument to binary operator i The error occurred in group 1: title = Full Professor, gender = "female". Run rlang::last_error() to see where the error occurred.

19. stop(fallback) 18. signal_abort(cnd) 17. abort(bullets, class = "dplyr_error") 16. h(simpleError(msg, call)) 15. .handleSimpleError(function (e) { local_call_step(dots = dots, .index = i, .fn = "summarise", .dot_data = inherits(e, "rlang_error_data_pronoun_not_found")) ... 14. quantile.default(salary, 0.25) 13. quantile(salary, 0.25) 12. mask$eval_all_summarise(quo) 11. withCallingHandlers({ for (i in seq_along(dots)) { mask$across_cache_reset() context_poke("column", old_current_column) ... 10. summarise_cols(.data, ..., caller_env = caller_env()) 9. summarise.grouped_df(., q25 = quantile(salary, 0.25), q50 = quantile(salary, 0.5), q75 = quantile(salary, 0.75), n = n()) 8. summarize(., q25 = quantile(salary, 0.25), q50 = quantile(salary, 0.5), q75 = quantile(salary, 0.75), n = n()) 7. ungroup(.) 6. mutate(., p = round(n/sum(n), 2)) 5. d5 %>% filter(!is.na(title) & title != "") %>% group_by(title, gender) %>% summarize(q25 = quantile(salary, 0.25), q50 = quantile(salary, 0.5), q75 = quantile(salary, 0.75), n = n()) %>% ungroup() %>% mutate(p = round(n/sum(n), 2)) at utils.R#91 4. create_salary_table(.) at utils.R#191 3. unique(t.salary$title) at utils.R#191 2. build_graph(., unit = i) 1. d5 %>% create_salary_table() %>% build_graph(unit = i)

lecy commented 2 years ago

Check the data type for salary prior to creating tables. It should be numeric.

aawoods97 commented 2 years ago

I am able to create a table that looks like the one in the instructions. However, I am not able to get my table to sort by salary. What can I do to fix this?

top5 <- function( d ) {
  d.top5 <-
    d %>%
    select( Full.Name, gender, Job.Description, Salary ) %>% 
    arrange( d, desc(Salary) ) %>%
    slice(1:5)

  # add formatting to numeric vector 
  d.top5$Salary <- dollarize( d.top5$Salary )

  return( d.top5 )
 }

top5(d)
Screen Shot 2021-10-08 at 9 00 17 PM

(Also my Salary is incorrect, but I am working on that too)

lecy commented 2 years ago

Prior to running the function what class is d$Salary?

In the examples I left d$Salary as was in the raw data (a string) and created a new variable d$salary that is numeric.

Should this be?

# select( Full.Name, gender, Job.Description, Salary )
select( Full.Name, gender, Job.Description, salary )
lecy commented 2 years ago

See: https://github.com/Watts-College/cpp-527-fall-2021/issues/87#issuecomment-939195898

aawoods97 commented 2 years ago

The class is numeric

sandralili commented 2 years ago

Dr. @lecy , I noticed indeed as you mentioned for some reason my function was not converting the character to number, I am trying to fix it, thanks!

d4$Salary was character and I used a function to add a column d4$salary (and this one was supposed to be numeric).

lecy commented 2 years ago

@aawoods97 d$Salary or d$salary is numeric?

aawoods97 commented 2 years ago

d$Salary. I don't have a salary with a lowercase s

lecy commented 2 years ago

Here's the problem - you are sending two data frames to arrange. One through piping and one as an argument:

# arrange( d, desc(Salary) )

  d.top5 <-
    d %>%
    select( Full.Name, gender, Job.Description, Salary ) %>% 
    arrange( desc(Salary) ) %>%
    slice(1:5)
aawoods97 commented 2 years ago

Thank you! This worked for me

sandralili commented 2 years ago

Dr. @lecy, it worked! finally. Thanks so much!!! It looks like the dollarized function was being called before creating the t.salary table

image

This is t.salary: image