Watts-College / cpp-527-fall-2021

A course shell for CPP 527 Foundations of Data Science II
https://watts-college.github.io/cpp-527-fall-2021/
2 stars 6 forks source link

Final Lab - Batch Processing #74

Open AhmedRashwanASU opened 2 years ago

AhmedRashwanASU commented 2 years ago

Good day Prof ,

please note that in other salaries years files there are added column X , with some notes, as well as the FTE Calculation, is different than 2020 data, after assigning column X to null in order to delete X , running the first function of getting the first names will return null results .

any idea how to solve the same ?

lecy commented 2 years ago

Which year are you looking at specifically?

AhmedRashwanASU commented 2 years ago

2019 - 2018 - 2017- 2016

lecy commented 2 years ago

You can drop the note (column X). It would be better to specify which columns to keep because that would be consistent across years, but for demo purposes:

> URL <- 'https://docs.google.com/spreadsheets/d/1RoiO9bfpbXowprWdZrgtYXG9_WuK3NFemwlvDGdym7E/export?gid=1948400967&format=csv'
> d <- read.csv( URL )
> d2 <- dplyr::select( d, -X )
> 
> head(d)
  Calendar.Year            Full.Name            Job.Description
1          2019      Abadjivor,Enyah                Coordinator
2          2019          Abbas,James            Assoc Professor
3          2019 Abbaszadegan,Morteza                  Professor
4          2019           Abbe,Scott Tech Support Analyst Coord
5          2019           Abbl,Norma           Sr HR Consultant
6          2019         Abbott,David            Assoc Professor
          Department.Description      Salary FTE
1      Research Division 2 Tempe  $45,195.00 100
2 Sch Biological & Hlth Sys Engr $101,795.00 100
3 Sch Sustain Engr & Built Envrn $143,625.00 100
4 Engineering Technical Services  $95,560.00 100
5                    HR Partners  $86,806.00 100
6                          Shesc  $86,188.00 100
                                                                                                                                                                                                                                                                   X
1 NOTE: This data is public record salary data of Arizona State University employees, compiled by The State Press. Last updated Dec. 5, 2017. View the numbers in a searchable database at http://www.statepress.com/article/2017/04/spinvestigative-salary-database
2                                                                                                                                                                                                                                                                   
3                                                                                                                                                                                                                                                                   
4                                                                                                                                                                                                                                                                   
5                                                                                                                                                                                                                                                                   
6                                                                                                                                                                                                                                                                   
> 
> head(d2)
  Calendar.Year            Full.Name            Job.Description
1          2019      Abadjivor,Enyah                Coordinator
2          2019          Abbas,James            Assoc Professor
3          2019 Abbaszadegan,Morteza                  Professor
4          2019           Abbe,Scott Tech Support Analyst Coord
5          2019           Abbl,Norma           Sr HR Consultant
6          2019         Abbott,David            Assoc Professor
          Department.Description      Salary FTE
1      Research Division 2 Tempe  $45,195.00 100
2 Sch Biological & Hlth Sys Engr $101,795.00 100
3 Sch Sustain Engr & Built Envrn $143,625.00 100
4 Engineering Technical Services  $95,560.00 100
5                    HR Partners  $86,806.00 100
6                          Shesc  $86,188.00 100

Full-Time Equivalency is scaled differently, max value of either 1 or 100.

You might need to add some conditionality to your normalization function.

if( max(FTE) == 100 )
{  salary <- salary / (FTE/100) }

if( max(FTE) == 1 )
{ salary <- salary / FTE }

Does that make sense to you?

AhmedRashwanASU commented 2 years ago

Yup , will apply the same , thanks prof

lecy commented 2 years ago

The joy of being a data analyst is that the world is marching toward entropy and your job is to create order and meaning from the chaos ;-)

AhmedRashwanASU commented 2 years ago

Step 2

Just to confirm that this function worked on 2020 data, however, returning null on 2019 Data, any idea? note that the below code is only to test the main function.


name.first <- sapply(strsplit(d2$Full.Name, " "), `[`, 2)

head(name.first)

[1] NA NA NA NA NA NA

2020 Data returns below

"Mohammad" "Jose" "Kelsea" "Enyah" "Precious" "James"

@lecy

lecy commented 2 years ago

Can you see what changed between 2019 and 2020? What should you use as the delimiter instead of a space?

### 2019 DATA
 [65] "Adelman,Madelaine"        "Adler,Patricia"          
 [67] "Adrian,Ronald"            "Adusumilli,Sesha Chandra"
 [69] "Afanador Pujol,Angelica"  "Affolter,Jacob"          
 [71] "Afsari Mamaghani,Sepideh" "Aganaba,Timiebi"     

### 2020 DATA
[1] "ABBASI, Mohammad"                    "ARQUIZA, Jose Maria Reynaldo Apollo"
[3] "Aaberg, Kelsea"                      "Abadjivor, Enyah"                   
[5] "Abayesu, Precious"        

Note that your heuristic above will fail when there are two last names:

"Afanador Pujol, Angelica"

It also won't return a single first name when the string includes middle names:

"ARQUIZA, Jose Maria Reynaldo Apollo"

See some hints here: https://github.com/Watts-College/cpp-527-fall-2021/issues/67#issuecomment-937135609

Just make sure you don't leave a space in front of the first name or the gender package will fail to match the name.

" Jose"  # no return value from gender package when there is a leading space
RachNicely commented 2 years ago

@lecy I am still getting an error message when I try to run the graph with 2019 data but I think I've figured out where it might be coming from.I am not able to generate the graph for some units in 2019.

build_graph( t.salary, unit="Ldrshp and Integrative Studies" ) ## Does not run for 2019

It looks like this Department.Description does not exist in the 2019 dataset so when I run through the academic.units provided, it is searching for something that doesn't exist. I'm not sure how to adapt the academic.units for each year since we only want a subset of the department descriptions.

2020:

image

2019:

image

lecy commented 2 years ago

Here's a great use of control structures.

Rule: if the academic unit does not exist in the dataset then skip it:

for( i in academic.units )
{

  d2 <- filter( d, Department.Description == i )
  if ( nrow(d2) == 0 ) { next }  # skips the rest of the code in the loop for this department
  ...

}
RachNicely commented 2 years ago

Thank you, Dr. Lecy! That was the final piece of the puzzle.

lecy commented 2 years ago

Thanks for identifying the problem department.

I added that tip to the instructions for others as well.

AhmedRashwanASU commented 2 years ago

@lecy Not sure if loading the link of 2019 data with edit URL, will run, apart from the export link

#####  BATCH.R FILE      

## 2020 REPORT

url.2020 <- "https://docs.google.com/spreadsheets/d/1RoiO9bfpbXowprWdZrgtYXG9_WuK3NFemwlvDGdym7E**/export?**gid=1335284952&format=csv"
rmarkdown::render( input='salary-report.rmd', 
                   output_file = "ASU-2020-Salary-Report.HTML",
                   params = list( url = url.2020 ) )

## 2019 REPORT 

url.2019 <- "https://docs.google.com/spreadsheets/d/1RoiO9bfpbXowprWdZrgtYXG9_WuK3NFemwlvDGdym7E/**edit#**gid=1948400967"
rmarkdown::render( input='salary-report.rmd', 
                   output_file = "ASU-2019-Salary-Report.HTML",
                   params = list( url = url.2019 ) )

image

lecy commented 2 years ago

No, it needs to be converted to the same format as the 2020 data (export CSV version):

> url.2019 <- "https://docs.google.com/spreadsheets/d/1RoiO9bfpbXowprWdZrgtYXG9_WuK3NFemwlvDGdym7E/**edit#**gid=1948400967"
> d <- read.csv( "url.2019" )
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'url.2019': No such file or directory