Open kimnewzealand opened 6 years ago
This RMD is a visual investigation of the intern dataset. The intern dataset has 214 observations of 13 variables. There are two main areas of investigation:
Equivalent hourly rate (in the case of yearly rates calculated using rate/(52*40)) compared to Intern Gender, Programme, Region, Internship Type, Intern Ethnicity, Institution and Offer Created at. The relationship between hourly rate, gender and other variables is investigated. An interesting question is, do female interns receive an hourly rate less than male interns? Within this dataset it seems the only variable which has an effect on rate per hour is the internship type.
Investigates relationships between Internship Type, Programme and Institution. Have used the FactoMineR package, a good tutorial is here http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/113-ca-correspondence-analysis-in-r-essentials/
These are the libraries used in this file.
library(plyr)
library(knitr)
library(tidyverse)
library(FactoMineR)
library("factoextra")
library(gridExtra)
library(grid)
library(scales)
Only the intern dataset is imported for this RMD file.
intern <- read.csv("sotdata-master\\sotdata-master\\Intern data Dec 2017.csv")
Calculate a hourly equivalent rate if the rate is of type year.
intern$rate_hour <- ifelse(intern$Rate.period=="hour", intern$Rate.value, intern$Rate.value/(52*40))
Median rate for male and female is similar, but there is more spread for female.
# Density plots with semi-transparent fill
intern.mf <- subset(intern, Intern.gender == "female" | Intern.gender == "male")
ggplot(intern.mf, aes(x=Intern.gender, y=rate_hour)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Gender")
This is a denisty plot so based on proportion. It seems that there are more females with low hourly rates.
ggplot(intern.mf, aes(x=rate_hour, fill=Intern.gender)) +
geom_density(alpha=.3)
If we look at absolute values, the numbers of males and females at the lower rates are similar, but there are less females at higher hourly rates.
ggplot(intern.mf, aes(x=rate_hour, fill=Intern.gender)) +
geom_histogram(binwidth=2, alpha=.3, position="identity")
Running a linear model on gender does not provide a significant result.
lm.gender <- lm(rate_hour ~ Intern.gender, intern.mf)
summary(lm.gender)
The most common category is pakeha. The data may benefit from rolling up into NZ Statistics standard categories. Interestingly for those that gender is undisclosed, ethnicity is undisclosed as well. The median rates are similar for the top three categories.
p1 <- ggplot(intern,
aes(x=reorder(Intern.ethnicity,
Intern.ethnicity,
function(x)-length(x)))) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Ethnicity (most common -> least common)") +
ylab("Count") +
ggtitle("Count of Interns by Ethnicity")
p2 <- ggplot(intern,
aes(x=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)),
fill=Intern.gender)) +
geom_bar(position = "fill",stat = "count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
legend.position="top") +
xlab("Ethnicity (most common -> least common)") +
ylab("Proportion") +
ggtitle("Gender of Interns v Ethnicity")
p3 <- ggplot(intern,
aes(x=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)), y=rate_hour)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Ethnicity (most common -> least common)") +
ylab("Rate per hour") +
ggtitle("Intern rates per hour by Ethnicity")
grid.arrange(p1, p2, p3, nrow = 3)
Coloured by gender, jitter has been added.
ggplot(intern, aes(x=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)), y=rate_hour, color=Intern.gender)) +
geom_point(position = position_jitter(w = 0.05, h = 0.1), shape=1, size=2, stroke=0.8) + # Draw points
scale_color_manual(values=c("#D55E00", "#0072B2", "#F0E442", "#CC79A7")) +
geom_segment(aes(x=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)),
xend=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)),
y=min(rate_hour),
yend=max(rate_hour)),
linetype="dashed",
colour="black",
size=0.1) + # Draw dashed lines
labs(title="Dot Plot",
subtitle="Intern Ethnicity Vs Rate by Hour") +
xlab("Intern Ethnicity (most common -> least common)") +
ylab("Rate per hour") +
theme_classic() +
coord_flip()
Running a linear model on ethnicity provides a marginal result.
lm.ethnicity <- lm(rate_hour ~ Intern.ethnicity, intern)
summary(lm.ethnicity)
By far the most common internship type is software developer and of those males were more common than females.
p1 <- ggplot(intern,
aes(x=reorder(internship.type,internship.type,function(x)-length(x)))) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Internship Type (most common -> least common)") +
ylab("Count") +
ggtitle("Count of Interns by Internship Type")
p2 <- ggplot(intern,
aes(x=reorder(internship.type,internship.type,function(x)-length(x)),
fill=Intern.gender)) +
geom_bar(position = "fill",stat = "count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
legend.position="top") +
xlab("Internship Type (most common -> least common)") +
ylab("Proportion") +
ggtitle("Gender of Interns by Internship Type")
p3 <- ggplot(intern,
aes(x=reorder(internship.type,internship.type,function(x)-length(x)), y=rate_hour)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Internship Type (most common -> least common)") +
ylab("Rate per hour") +
ggtitle("Intern rates per hour by Internship type")
grid.arrange(p1, p2, p3, nrow = 3)
Because there are too many categories a simplified variable for internship type has been created. When controlling for internship type gender seems to have no effect on median rate.
intern <- intern %>%
mutate(internship.type.ch = as.character(internship.type)) %>%
mutate(internship.type.simplified = ifelse(internship.type.ch %in% c("software developer",
"design",
"marketing",
"IT consultant",
"hr",
"mobile developer",
"security",
"data analyst"),
internship.type.ch,
"Other")) %>%
mutate(internship.type.simplified = as.factor(internship.type.simplified)) %>%
select(-internship.type.ch)
ggplot(intern,
aes(x=reorder(internship.type.simplified,internship.type.simplified,function(x)-length(x)), y=rate_hour, fill=Intern.gender)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Internship Type (most common -> least common)") +
ylab("Rate per hour") +
ggtitle("Boxplot of Intern rates per hour by Internship type and Gender")
Coloured by gender, jitter has been added.
ggplot(intern, aes(x=reorder(internship.type,internship.type,function(x)-length(x)), y=rate_hour, color=Intern.gender)) +
geom_point(position = position_jitter(w = 0.05, h = 0.1), shape=1, size=2, stroke = 0.8) + # Draw points
scale_color_manual(values=c("#D55E00", "#0072B2", "#F0E442", "#CC79A7")) +
geom_segment(aes(x=reorder(internship.type,internship.type,function(x)-length(x)),
xend=reorder(internship.type,internship.type,function(x)-length(x)),
y=min(rate_hour),
yend=max(rate_hour)),
linetype="dashed",
colour="black",
size=0.1) + # Draw dashed lines
labs(title="Dot Plot",
subtitle="Internship Type Vs Rate by Hour") +
xlab("Internship Type (most common -> least common)") +
theme_classic() +
coord_flip()
Running a linear model on internship type provides a significant result, but R-square is low at 16%.
lm.internship.type <- lm(rate_hour ~ internship.type, intern)
summary(lm.internship.type)
No it does not. An anova was used comparing a linear model of simplified intern type as the independent variable and rate as the dependant variable against a more complex model of simplified intern type and gender as the independent variables and rate as the dependant variable.
lm.intern.type.gender <- lm(rate_hour ~ internship.type.simplified + Intern.gender, intern.mf)
anova(lm.intern.type.s, lm.intern.type.gender)
Most interns are from Wellington and Auckland with similar rates of gender. Median hourly rates are similar between Auckland and Wellington but Wellington has longer right hand tail.
p1 <- ggplot(intern, aes(x=reorder(Region,Region,function(x)-length(x)))) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
axis.title.x = element_blank()) +
ylab("Count") +
xlab("Region (most common -> least common)") +
ggtitle("Count of Interns by Region")
p2 <- ggplot(intern, aes(x=reorder(Region,Region,function(x)-length(x)),
fill=Intern.gender)) +
geom_bar(position = "fill",stat = "count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
legend.position="top") +
xlab("Region (most common -> least common)") +
ylab("Proportion") +
ggtitle("Gender of Interns by Region")
p3 <- ggplot(intern,
aes(x=reorder(Region,Region,function(x)-length(x)), y=rate_hour)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Region (most common -> least common)") +
ylab("Rate per hour") +
ggtitle("Intern rates per hour by Region")
grid.arrange(p1, p2, p3, nrow = 3)
Region is shown to be not significant.
lm.Region <- lm(rate_hour ~ Region, intern)
summary(lm.Region)
Very little difference in the median hourly rate for the top 4 institutions. But the proportion of males is higher at Victoria University.
p1 <- ggplot(intern, aes(x=reorder(Institution,Institution,function(x)-length(x)))) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Institution (most common -> least common)") +
ylab("Count") +
ggtitle("Count of Interns by Institution")
p2 <- ggplot(intern, aes(x=reorder(Institution,Institution,function(x)-length(x)),
fill=Intern.gender)) +
geom_bar(position = "fill",stat = "count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
legend.position="top") +
xlab("Institution (most common -> least common)") +
ylab("Proportion") +
ggtitle("Gender of Interns by Institution")
p3 <- ggplot(intern,
aes(x=reorder(Institution,Institution,function(x)-length(x)), y=rate_hour)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Institution (most common -> least common)") +
ylab("Rate per hour") +
ggtitle("Intern rates per hour by Institution")
grid.arrange(p1, p2, p3, nrow = 3)
Coloured by gender, jitter has been added.
ggplot(intern, aes(x=reorder(Institution,Institution,function(x)-length(x)), y=rate_hour, color=Intern.gender)) +
geom_point(position = position_jitter(w = 0.05, h = 0.1), shape=1, size=2, stroke=0.8) + # Draw points
scale_color_manual(values=c("#D55E00", "#0072B2", "#F0E442", "#CC79A7")) +
geom_segment(aes(x=reorder(Institution,Institution,function(x)-length(x)),
xend=reorder(Institution,Institution,function(x)-length(x)),
y=min(rate_hour),
yend=max(rate_hour)),
linetype="dashed",
colour="black",
size=0.1) + # Draw dashed lines
labs(title="Dot Plot",
subtitle="Institution Vs Rate by Hour") +
xlab("Institution (most common -> least common)") +
theme_classic() +
coord_flip()
Institution is shown to be not significant.
lm.Institution <- lm(rate_hour ~ Institution, intern)
summary(lm.Institution)
No difference in median rate by gender for rate type, there is a larger spread in hourly rates.
p1 <- ggplot(intern, aes(x=reorder(Rate.period,Rate.period,function(x)-length(x)))) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Institution (most common -> least common)") +
ylab("Count") +
ggtitle("Count of Interns by Rate Period Type")
p2 <- ggplot(intern, aes(x=reorder(Rate.period,Rate.period,function(x)-length(x)),
fill=Intern.gender)) +
geom_bar(position = "fill",stat = "count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
legend.position="top") +
xlab("Institution (most common -> least common)") +
ylab("Proportion") +
ggtitle("Gender of Interns by Rate Period Type")
p3 <- ggplot(intern,
aes(x=reorder(Rate.period,Rate.period,function(x)-length(x)), y=rate_hour)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Institution (most common -> least common)") +
ylab("Rate per hour") +
ggtitle("Intern rates per hour by Rate Period Type")
grid.arrange(p1, p2, p3, nrow = 3)
Third year is the most common, but the median rates are similar.
p1 <- ggplot(intern, aes(x=as.factor(Year.of.study))) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Year of Study") +
ylab("Count") +
ggtitle("Count of Interns by Rate Period Type")
p2 <- ggplot(intern, aes(x=as.factor(Year.of.study), fill=Intern.gender)) +
geom_bar(position = "fill",stat = "count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
legend.position="top") +
xlab("Year of Study") +
ylab("Proportion") +
ggtitle("Gender of Interns by Year of Study")
p3 <- ggplot(intern, aes(x=as.factor(Year.of.study), y=rate_hour)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Year of Study") +
ylab("Rate per hour") +
ggtitle("Intern rates per hour by Year of Study")
grid.arrange(p1, p2, p3, nrow = 3)
Most offers happen quickly, there are three offers up to 6 months before the range of this graph.
intern$date <- as.Date(intern$Offer.Created.at)
# plot
ggplot(intern, aes(x=date)) +
geom_point(aes(y=rate_hour, col=Intern.gender)) +
labs(title="Time Series of Offer Date v rate",
y="Hourly Rate") + # title and caption
xlim(as.Date(c('2017-09-25','2017-12-25'))) +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8), # rotate x axis text
panel.grid.minor = element_blank(), legend.position="top") # turn off minor grid
Summer of Tech is the most common programme for interns. Males are a higher proportion of interns for Summer of Tech, females for Summer of Biz. Median Hourly rates are higher for Summer of Tech over Summer of Biz.
p1 <- ggplot(intern, aes(x=reorder(Programme,Programme,function(x)-length(x)))) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Institution (most common -> least common)") +
ylab("Count") +
ggtitle("Count of Interns by Programme")
p2 <- ggplot(intern, aes(x=reorder(Programme,Programme,function(x)-length(x)),
fill=Intern.gender)) +
geom_bar(position = "fill",stat = "count") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5),
legend.position="top") +
xlab("Institution (most common -> least common)") +
ylab("Proportion") +
ggtitle("Gender of Interns by Programme")
p3 <- ggplot(intern,
aes(x=reorder(Programme,Programme,function(x)-length(x)), y=rate_hour)) +
geom_boxplot() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Institution (most common -> least common)") +
ylab("Rate per hour") +
ggtitle("Intern rates per hour by Programme")
grid.arrange(p1, p2, p3, nrow = 3)
Coloured by gender, jitter has been added.
ggplot(intern, aes(x=reorder(Programme,Programme,function(x)-length(x)), y=rate_hour, color=Intern.gender)) +
geom_point(position = position_jitter(w = 0.05, h = 0.1), shape=1, size=2, stroke=0.8) + # Draw points
scale_color_manual(values=c("#D55E00", "#0072B2", "#F0E442", "#CC79A7")) +
geom_segment(aes(x=reorder(Programme,Programme,function(x)-length(x)),
xend=reorder(Programme,Programme,function(x)-length(x)),
y=min(rate_hour),
yend=max(rate_hour)),
linetype="dashed",
colour="black",
size=0.1) + # Draw dashed lines
labs(title="Dot Plot",
subtitle="Programme Vs Rate by Hour") +
xlab("Programme (most common -> least common)") +
ylab("Rate per hour") +
theme_classic() +
coord_flip()
The largest category is Victoria University and software developer. Victoria University is well represented in other internship types and software developer well represented in other institution types.
intern.i.it <- intern %>%
select(internship.type, Institution) %>%
group_by(internship.type, Institution) %>%
summarise(count = n())
p <- ggplot(intern.i.it, aes(x = internship.type, y = Institution))
p + geom_point( aes(size=count)) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
xlab("Internship Type") +
ylab("Institution") +
ggtitle("Internship Type v Institution")
![Uploading BubblePlotInternInst.png…]()
Summer of Biz and Sot Jobs Biz only provided hr and marketing internships.
intern.p.it <- intern %>%
select(internship.type, Programme) %>%
group_by(internship.type, Programme) %>%
summarise(n = n())
p <- ggplot(intern.p.it, aes(x = internship.type, y = Programme))
p + geom_point( aes(size=n)) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Summer of Tech is well represented in all institutions.
intern.p.it <- intern %>%
select(Institution, Programme) %>%
group_by(Institution, Programme) %>%
summarise(n = n())
p <- ggplot(intern.p.it, aes(x = Institution, y = Programme))
p + geom_point( aes(size=n)) +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
Cannot see very well as too many categories. Scree-plot indicates 4 dimensions (not shown here). Also p-value is high. This is not shown here.
Abbreviate the instituion types.
VU = Victoria University of Wellington
AU = Auckland University
MUW = Massey University (Wellington)
UC = University of Canterbury
W = Whitireia New Zealand
WIT = Wellington Institute of Technology (Weltec)
Everything else is Other
The p-value is now highly significant. Screeplot indicates 2 dimensions.
Wellington Institute of Technology (Weltec) closest aligned to security and hr. Massey University (Wellington) with design. Everthing else is lumped together, perhaps to align Auckland University with data analyst.
intern$Institution.abbv <- revalue(intern$Institution,
c("Auckland Institute of Studies" = "AIS",
"Auckland University" = "AU",
"Auckland University of Technology" = "AUT",
"Computer Power Plus" = "CPP",
"Enspiral Dev Academy" = "EDA",
"Massey University (Wellington)" = "MUW",
"Media Design School" = "MDS",
"Toi Ohomai Institute of Technology" = "TOIT",
"Unitec Institute of Technology" = "UIT",
"University of Canterbury" = "UC",
"University of Otago" = "UO",
"University of Waikato" = "UW",
"Victoria University of Wellington" = "VU",
"Waikato University" = "WU",
"Wellington ICT Graduate School (WICTGS)" = "WIGS",
"Wellington Institute of Technology (Weltec)" = "WIT",
"Whitireia New Zealand" = "W" ,
"Yoobee School of Design" = "Y"))
intern$internship.type.abbv <- revalue(intern$internship.type,
c("Business Analyst" = "BA",
"data analyst" = "DA",
"design" = "D",
"DevOps" = "DO",
"electronics" = "E",
"hr" = "H",
"information management" = "IM",
"infrastructure" = "I",
"IT consultant" = "ITC",
"IT support" = "ITS",
"marketing" = "M",
"mobile developer" = "MD",
"product manager" = "PDM",
"Project Manager" = "PJM",
"security" = "S",
"software developer" = "SD",
"systems engineer" = "SE",
"testing" = "T",
"VR/AR developer" = "VR",
"web developer" = "WD"))
intern.i.it <- intern %>%
select(Institution.abbv, internship.type) %>%
mutate(Institution.ch.abbv = as.character(Institution.abbv),
internship.type.ch = as.character(internship.type)) %>%
mutate(Institution = ifelse(Institution.ch.abbv %in% c("VU",
"AU",
"MUW",
"UC",
"W",
"WIT"),
Institution.ch.abbv,
"Other"),
internship.type = ifelse(internship.type.ch %in% c("software developer",
"design",
"marketing",
"IT consultant",
"hr",
"mobile developer",
"security",
"data analyst"),
internship.type.ch,
"Other")) %>%
mutate(Institution = as.factor(Institution),
internship.type = as.factor(internship.type))
table.i.it <- with(intern.i.it, table(internship.type,Institution))
intern.ca <- CA(table.i.it, graph = FALSE)
print(intern.ca)
fviz_screeplot(intern.ca, addlabels = TRUE)
fviz_ca_biplot(intern.ca, repel = TRUE, title="Correspondence Analysis: Internship Type v Institution for Successful Interns")
This is also highly significant. Dimension 1 accounts for over 90% of variation, the second is much smaller at around 6%. Summer of Biz and Sot Jobs (Biz) are closely aligned with hr and marketing. The screeplot indicates 1 dimension.
intern.i.it <- intern %>%
select(Programme, internship.type) %>%
mutate(Programme = as.character(Programme),
internship.type = as.character(internship.type)) %>%
mutate(internship.type = ifelse(internship.type %in% c("software developer",
"design",
"marketing",
"IT consultant",
"hr",
"mobile developer",
"security",
"data analyst"),
internship.type,
"Other")) %>%
mutate(Institution = as.factor(Programme),
internship.type = as.factor(internship.type))
table.p.it <- with(intern.i.it, table(internship.type,Programme))
intern.ca <- CA(table.p.it, graph = FALSE)
print(intern.ca)
fviz_screeplot(intern.ca, addlabels = TRUE)
fviz_ca_biplot(intern.ca, repel = TRUE, title="Correspondence Analysis: Internship Type v Programme for Successful Interns")
Significant at 0.05. Screeplot probably recommends 3 dimensions.
intern.p.i <- intern %>%
select(Institution.abbv, Programme) %>%
mutate(Institution.ch.abbv = as.character(Institution.abbv)) %>%
mutate(Institution = ifelse(Institution.ch.abbv %in% c("VU",
"AU",
"MUW",
"UC",
"W",
"WIT"),
Institution.ch.abbv,
"Other")) %>%
mutate(Institution = as.factor(Institution))
table.p.i <- with(intern.p.i, table(Institution,Programme))
intern.ca <- CA(table.p.i, graph = FALSE)
print(intern.ca)
fviz_screeplot(intern.ca, addlabels = TRUE)
fviz_ca_biplot(intern.ca, repel = TRUE, title="Correspondence Analysis: Internship Type v Programme for Successful Interns")
Thanks for contributing. However the purpose of this repo is to share dataviz for others to learn and share EDA type dataviz, and to help SoT with their questions. This detailed statistical analysis may be quite detailed for this initial look at the data.
I see you are anonymous, please could you add your name to your GitHub page?
I have added my name now. I can take this down if you feel it is not suitable.
Thanks, we will leave it here absolutely as all contributions are useful as everyone looks at data differently and has a different perspective. So I should have said that up front.
That it was detailed was just my feedback.
On another note I see you are using R Markdown, did using the R notebook work ?
Cool, comments taken on board, feedback is one of the most important parts of writing code.
Regarding RMarkdown, I upgraded to R 3.5 which fixed my issues, thanks.
One way to initially visualise the data of the individual files. Plot univariate bar plots of the categorical variables using the plot_bar function from DataExplorer R Package, ignoring variables with more than 18 categories for now.
These two variables with very large category numbers are ignored for now: token: 214 categories Offer Created at: 198 categories
From these bar plots, it appears that for December 2017: