R-LadiesAKL / sotdata

Data for Summer of Tech dataviz meetup 21 June 2018
0 stars 1 forks source link

Initial analysis and visualisation of individual files - Intern #2

Open kimnewzealand opened 6 years ago

kimnewzealand commented 6 years ago

One way to initially visualise the data of the individual files. Plot univariate bar plots of the categorical variables using the plot_bar function from DataExplorer R Package, ignoring variables with more than 18 categories for now.

# Convert the year of study to character as this is a categorical variable
intern$`Year of study` <- as.character(intern$`Year of study`)
# Plot the intern data setting the maximum number of categories to 17 to view on screen
DataExplorer::plot_bar(intern,maxcat=17)
# Plot the internship type with  21 categories to fit to screen
DataExplorer::plot_bar(intern$`internship type`)
# Plot the Institution with 19 categories separately to fit to screen
DataExplorer::plot_bar(intern$Institution)

These two variables with very large category numbers are ignored for now: token: 214 categories Offer Created at: 198 categories

intern_plot_bar1 intern_plot_bar2 intern_plot_bar3

From these bar plots, it appears that for December 2017:

cabbagetreecustard commented 6 years ago

Investigation of the intern Summer of Code dataset.

This RMD is a visual investigation of the intern dataset. The intern dataset has 214 observations of 13 variables. There are two main areas of investigation:

Effect of independant variable on hourly rate

Equivalent hourly rate (in the case of yearly rates calculated using rate/(52*40)) compared to Intern Gender, Programme, Region, Internship Type, Intern Ethnicity, Institution and Offer Created at. The relationship between hourly rate, gender and other variables is investigated. An interesting question is, do female interns receive an hourly rate less than male interns? Within this dataset it seems the only variable which has an effect on rate per hour is the internship type.

Correspondence Analysis for Categorical Variables

Investigates relationships between Internship Type, Programme and Institution. Have used the FactoMineR package, a good tutorial is here http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/113-ca-correspondence-analysis-in-r-essentials/

These are the libraries used in this file.

library(plyr)
library(knitr)
library(tidyverse)
library(FactoMineR)
library("factoextra")
library(gridExtra)
library(grid)
library(scales)

Only the intern dataset is imported for this RMD file.

intern <- read.csv("sotdata-master\\sotdata-master\\Intern data Dec 2017.csv")

Calculate a hourly equivalent rate if the rate is of type year.

intern$rate_hour <- ifelse(intern$Rate.period=="hour", intern$Rate.value, intern$Rate.value/(52*40)) 

Rate by Gender Boxplot

Median rate for male and female is similar, but there is more spread for female.

# Density plots with semi-transparent fill
intern.mf <- subset(intern, Intern.gender == "female" | Intern.gender == "male")

ggplot(intern.mf, aes(x=Intern.gender, y=rate_hour)) +   
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Gender")

boxplot plot of gender v rate per hour

Rate by Gender Density

This is a denisty plot so based on proportion. It seems that there are more females with low hourly rates.

ggplot(intern.mf, aes(x=rate_hour, fill=Intern.gender)) + 
  geom_density(alpha=.3) 

density plot of gender v rate per hour

Rate by Gender Histogram

If we look at absolute values, the numbers of males and females at the lower rates are similar, but there are less females at higher hourly rates.

ggplot(intern.mf, aes(x=rate_hour, fill=Intern.gender)) + 
  geom_histogram(binwidth=2, alpha=.3, position="identity") 

histogram of gender v rate per hour

Significance of gender on hourly rate

Running a linear model on gender does not provide a significant result.

lm.gender <- lm(rate_hour ~ Intern.gender, intern.mf)
summary(lm.gender)

image

Interns by Ethnicity

The most common category is pakeha. The data may benefit from rolling up into NZ Statistics standard categories. Interestingly for those that gender is undisclosed, ethnicity is undisclosed as well. The median rates are similar for the top three categories.

p1 <- ggplot(intern, 
             aes(x=reorder(Intern.ethnicity,
                           Intern.ethnicity,
                           function(x)-length(x)))) +
  geom_bar(stat="count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Ethnicity (most common -> least common)") +
  ylab("Count") +
  ggtitle("Count of Interns by Ethnicity") 

p2 <- ggplot(intern, 
             aes(x=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)), 
                 fill=Intern.gender)) +
  geom_bar(position = "fill",stat = "count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), 
        legend.position="top") +  
  xlab("Ethnicity (most common -> least common)") +
  ylab("Proportion") +
  ggtitle("Gender of Interns v Ethnicity") 

p3 <- ggplot(intern, 
             aes(x=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)), y=rate_hour)) +   
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Ethnicity (most common -> least common)") +
  ylab("Rate per hour") +
  ggtitle("Intern rates per hour by Ethnicity")

grid.arrange(p1, p2, p3, nrow = 3)

ethnicity

Dotplot of ethnicity v hourly rate

Coloured by gender, jitter has been added.

ggplot(intern, aes(x=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)), y=rate_hour, color=Intern.gender)) + 
  geom_point(position = position_jitter(w = 0.05, h = 0.1), shape=1, size=2, stroke=0.8) +   # Draw points
  scale_color_manual(values=c("#D55E00", "#0072B2", "#F0E442", "#CC79A7")) +
  geom_segment(aes(x=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)), 
                   xend=reorder(Intern.ethnicity,Intern.ethnicity,function(x)-length(x)), 
                   y=min(rate_hour), 
                   yend=max(rate_hour)), 
               linetype="dashed", 
               colour="black",
               size=0.1) +   # Draw dashed lines
  labs(title="Dot Plot", 
       subtitle="Intern Ethnicity Vs Rate by Hour") + 
  xlab("Intern Ethnicity (most common -> least common)") + 
  ylab("Rate per hour") +
  theme_classic() +
  coord_flip()

ethnicitydotplot

Significance of ethnicity on hourly rate

Running a linear model on ethnicity provides a marginal result.

lm.ethnicity <- lm(rate_hour ~ Intern.ethnicity, intern)
summary(lm.ethnicity)

image

Intern Type

By far the most common internship type is software developer and of those males were more common than females.

p1 <- ggplot(intern, 
             aes(x=reorder(internship.type,internship.type,function(x)-length(x)))) +
  geom_bar(stat="count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Internship Type (most common -> least common)") +
  ylab("Count") +
  ggtitle("Count of Interns by Internship Type") 

p2 <- ggplot(intern, 
             aes(x=reorder(internship.type,internship.type,function(x)-length(x)), 
             fill=Intern.gender)) +
  geom_bar(position = "fill",stat = "count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), 
        legend.position="top") +  
  xlab("Internship Type (most common -> least common)") +
  ylab("Proportion") +
  ggtitle("Gender of Interns by Internship Type") 

p3 <- ggplot(intern, 
             aes(x=reorder(internship.type,internship.type,function(x)-length(x)), y=rate_hour)) +   
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Internship Type (most common -> least common)") +
  ylab("Rate per hour") +
  ggtitle("Intern rates per hour by Internship type")

grid.arrange(p1, p2, p3, nrow = 3)

Boxplot of reduced internship type and gender for hourly rate

Because there are too many categories a simplified variable for internship type has been created. When controlling for internship type gender seems to have no effect on median rate.

intern <- intern %>%
  mutate(internship.type.ch = as.character(internship.type)) %>%
  mutate(internship.type.simplified = ifelse(internship.type.ch %in% c("software developer",
                                                         "design",
                                                         "marketing",
                                                         "IT consultant",
                                                         "hr",
                                                         "mobile developer",
                                                         "security",
                                                         "data analyst"),
                                  internship.type.ch,
                                  "Other")) %>%
  mutate(internship.type.simplified = as.factor(internship.type.simplified)) %>%
  select(-internship.type.ch)

ggplot(intern, 
             aes(x=reorder(internship.type.simplified,internship.type.simplified,function(x)-length(x)), y=rate_hour, fill=Intern.gender)) +   
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Internship Type (most common -> least common)") +
  ylab("Rate per hour") +
  ggtitle("Boxplot of Intern rates per hour by Internship type and Gender")

Dotplot of internship type v hourly rate

Coloured by gender, jitter has been added.

ggplot(intern, aes(x=reorder(internship.type,internship.type,function(x)-length(x)), y=rate_hour, color=Intern.gender)) + 
  geom_point(position = position_jitter(w = 0.05, h = 0.1), shape=1, size=2, stroke = 0.8) +   # Draw points
  scale_color_manual(values=c("#D55E00", "#0072B2", "#F0E442", "#CC79A7")) +
  geom_segment(aes(x=reorder(internship.type,internship.type,function(x)-length(x)), 
                   xend=reorder(internship.type,internship.type,function(x)-length(x)), 
                   y=min(rate_hour), 
                   yend=max(rate_hour)), 
               linetype="dashed", 
               colour="black",
               size=0.1) +   # Draw dashed lines
  labs(title="Dot Plot", 
       subtitle="Internship Type Vs Rate by Hour") + 
  xlab("Internship Type (most common -> least common)") + 
  theme_classic() +
  coord_flip()

Significance of Internship type on hourly rate

Running a linear model on internship type provides a significant result, but R-square is low at 16%.

lm.internship.type <- lm(rate_hour ~ internship.type, intern)
summary(lm.internship.type)

image

Does gender have an effect on hourly rate once internship type taken into account

No it does not. An anova was used comparing a linear model of simplified intern type as the independent variable and rate as the dependant variable against a more complex model of simplified intern type and gender as the independent variables and rate as the dependant variable.

lm.intern.type.gender <- lm(rate_hour ~ internship.type.simplified + Intern.gender, intern.mf)
anova(lm.intern.type.s, lm.intern.type.gender)

image

Region

Most interns are from Wellington and Auckland with similar rates of gender. Median hourly rates are similar between Auckland and Wellington but Wellington has longer right hand tail.

p1 <- ggplot(intern, aes(x=reorder(Region,Region,function(x)-length(x)))) +
  geom_bar(stat="count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), 
        axis.title.x = element_blank()) + 
  ylab("Count") +
  xlab("Region (most common -> least common)") +
  ggtitle("Count of Interns by Region") 

p2 <- ggplot(intern, aes(x=reorder(Region,Region,function(x)-length(x)), 
                         fill=Intern.gender)) +
  geom_bar(position = "fill",stat = "count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), 
        legend.position="top") + 
  xlab("Region (most common -> least common)") +
  ylab("Proportion") +
  ggtitle("Gender of Interns by Region") 

p3 <- ggplot(intern, 
             aes(x=reorder(Region,Region,function(x)-length(x)), y=rate_hour)) +   
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Region (most common -> least common)") +
  ylab("Rate per hour") +
  ggtitle("Intern rates per hour by Region")

grid.arrange(p1, p2, p3, nrow = 3)

region

Significance of Region on hourly rate

Region is shown to be not significant.

lm.Region <- lm(rate_hour ~ Region, intern)
summary(lm.Region)

Institution

Very little difference in the median hourly rate for the top 4 institutions. But the proportion of males is higher at Victoria University.

p1 <- ggplot(intern, aes(x=reorder(Institution,Institution,function(x)-length(x)))) +
  geom_bar(stat="count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Institution (most common -> least common)") +
  ylab("Count") +
  ggtitle("Count of Interns by Institution") 

p2 <- ggplot(intern, aes(x=reorder(Institution,Institution,function(x)-length(x)), 
                         fill=Intern.gender)) +
  geom_bar(position = "fill",stat = "count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), 
        legend.position="top") + 
  xlab("Institution (most common -> least common)") +
  ylab("Proportion") +
  ggtitle("Gender of Interns by Institution") 

p3 <- ggplot(intern, 
             aes(x=reorder(Institution,Institution,function(x)-length(x)), y=rate_hour)) +   
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Institution (most common -> least common)") +
  ylab("Rate per hour") +
  ggtitle("Intern rates per hour by Institution")

grid.arrange(p1, p2, p3, nrow = 3)

institution

Dotplot of Institution v hourly rate

Coloured by gender, jitter has been added.

ggplot(intern, aes(x=reorder(Institution,Institution,function(x)-length(x)), y=rate_hour, color=Intern.gender)) + 
  geom_point(position = position_jitter(w = 0.05, h = 0.1), shape=1, size=2, stroke=0.8) +   # Draw points
  scale_color_manual(values=c("#D55E00", "#0072B2", "#F0E442", "#CC79A7")) +
  geom_segment(aes(x=reorder(Institution,Institution,function(x)-length(x)), 
                   xend=reorder(Institution,Institution,function(x)-length(x)), 
                   y=min(rate_hour), 
                   yend=max(rate_hour)), 
               linetype="dashed", 
               colour="black",
               size=0.1) +   # Draw dashed lines
  labs(title="Dot Plot", 
       subtitle="Institution Vs Rate by Hour") + 
  xlab("Institution (most common -> least common)") + 
  theme_classic() +
  coord_flip()

institutiondotplot

Significance of Institution on hourly rate

Institution is shown to be not significant.

lm.Institution <- lm(rate_hour ~ Institution, intern)
summary(lm.Institution)

image

Rate.Period

No difference in median rate by gender for rate type, there is a larger spread in hourly rates.

p1 <- ggplot(intern, aes(x=reorder(Rate.period,Rate.period,function(x)-length(x)))) +
  geom_bar(stat="count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Institution (most common -> least common)") +
  ylab("Count") +
  ggtitle("Count of Interns by Rate Period Type") 

p2 <- ggplot(intern, aes(x=reorder(Rate.period,Rate.period,function(x)-length(x)), 
                         fill=Intern.gender)) +
  geom_bar(position = "fill",stat = "count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), 
        legend.position="top") + 
  xlab("Institution (most common -> least common)") +
  ylab("Proportion") +
  ggtitle("Gender of Interns by Rate Period Type") 

p3 <- ggplot(intern, 
             aes(x=reorder(Rate.period,Rate.period,function(x)-length(x)), y=rate_hour)) +   
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Institution (most common -> least common)") +
  ylab("Rate per hour") +
  ggtitle("Intern rates per hour by Rate Period Type")

grid.arrange(p1, p2, p3, nrow = 3)

rateperiod

Rate by Year of Study

Third year is the most common, but the median rates are similar.

p1 <- ggplot(intern, aes(x=as.factor(Year.of.study))) +
  geom_bar(stat="count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Year of Study") +
  ylab("Count") +
  ggtitle("Count of Interns by Rate Period Type") 

p2 <- ggplot(intern, aes(x=as.factor(Year.of.study), fill=Intern.gender)) +
  geom_bar(position = "fill",stat = "count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), 
        legend.position="top") + 
  xlab("Year of Study") +
  ylab("Proportion") +
  ggtitle("Gender of Interns by Year of Study") 

p3 <- ggplot(intern, aes(x=as.factor(Year.of.study), y=rate_hour)) +   
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Year of Study") +
  ylab("Rate per hour") +
  ggtitle("Intern rates per hour by Year of Study")

grid.arrange(p1, p2, p3, nrow = 3)

rateyearstudy

Rate by Offer Date

Most offers happen quickly, there are three offers up to 6 months before the range of this graph.

intern$date <- as.Date(intern$Offer.Created.at)

# plot
ggplot(intern, aes(x=date)) + 
  geom_point(aes(y=rate_hour, col=Intern.gender)) + 
  labs(title="Time Series of Offer Date v rate",  
       y="Hourly Rate") +  # title and caption
  xlim(as.Date(c('2017-09-25','2017-12-25'))) +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8),  # rotate x axis text
        panel.grid.minor = element_blank(), legend.position="top")  # turn off minor grid

ratecreatedat

Rate by Programme

Summer of Tech is the most common programme for interns. Males are a higher proportion of interns for Summer of Tech, females for Summer of Biz. Median Hourly rates are higher for Summer of Tech over Summer of Biz.

p1 <- ggplot(intern, aes(x=reorder(Programme,Programme,function(x)-length(x)))) +
  geom_bar(stat="count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Institution (most common -> least common)") +
  ylab("Count") +
  ggtitle("Count of Interns by Programme") 

p2 <- ggplot(intern, aes(x=reorder(Programme,Programme,function(x)-length(x)), 
                         fill=Intern.gender)) +
  geom_bar(position = "fill",stat = "count") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), 
        legend.position="top") + 
  xlab("Institution (most common -> least common)") +
  ylab("Proportion") +
  ggtitle("Gender of Interns by Programme") 

p3 <- ggplot(intern, 
             aes(x=reorder(Programme,Programme,function(x)-length(x)), y=rate_hour)) +   
  geom_boxplot() +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + 
  xlab("Institution (most common -> least common)") +
  ylab("Rate per hour") +
  ggtitle("Intern rates per hour by Programme")

grid.arrange(p1, p2, p3, nrow = 3)

programme

Dotplot of Programme v hourly rate

Coloured by gender, jitter has been added.

ggplot(intern, aes(x=reorder(Programme,Programme,function(x)-length(x)), y=rate_hour, color=Intern.gender)) + 
  geom_point(position = position_jitter(w = 0.05, h = 0.1), shape=1, size=2, stroke=0.8) +   # Draw points
  scale_color_manual(values=c("#D55E00", "#0072B2", "#F0E442", "#CC79A7")) +
  geom_segment(aes(x=reorder(Programme,Programme,function(x)-length(x)), 
                   xend=reorder(Programme,Programme,function(x)-length(x)), 
                   y=min(rate_hour), 
                   yend=max(rate_hour)), 
               linetype="dashed", 
               colour="black",
               size=0.1) +   # Draw dashed lines
  labs(title="Dot Plot", 
       subtitle="Programme Vs Rate by Hour") + 
  xlab("Programme (most common -> least common)") + 
  ylab("Rate per hour") +
  theme_classic() +
  coord_flip()

dotplotprogramme

Bubble plot of internship type v Institution

The largest category is Victoria University and software developer. Victoria University is well represented in other internship types and software developer well represented in other institution types.

intern.i.it <- intern %>% 
  select(internship.type, Institution) %>%
  group_by(internship.type, Institution) %>%
  summarise(count = n())

p <- ggplot(intern.i.it, aes(x = internship.type, y = Institution)) 
p +  geom_point( aes(size=count)) + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
  xlab("Internship Type") +
  ylab("Institution") +
  ggtitle("Internship Type v Institution")

![Uploading BubblePlotInternInst.png…]()

Bubble plot of internship type v Programme

Summer of Biz and Sot Jobs Biz only provided hr and marketing internships.

intern.p.it <- intern %>% 
  select(internship.type, Programme) %>%
  group_by(internship.type, Programme) %>%
  summarise(n = n())

p <- ggplot(intern.p.it, aes(x = internship.type, y = Programme)) 
p +  geom_point( aes(size=n)) + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

bubbleplotinternprog

Bubble plot of Instituion v Programme

Summer of Tech is well represented in all institutions.

intern.p.it <- intern %>% 
  select(Institution, Programme) %>%
  group_by(Institution, Programme) %>%
  summarise(n = n())

p <- ggplot(intern.p.it, aes(x = Institution, y = Programme)) 
p +  geom_point( aes(size=n)) + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

bubbleplotproginst

Correspondence analysis between institution and intern type

Cannot see very well as too many categories. Scree-plot indicates 4 dimensions (not shown here). Also p-value is high. This is not shown here.

Abbreviate the instituion types. VU = Victoria University of Wellington
AU = Auckland University
MUW = Massey University (Wellington)
UC = University of Canterbury
W = Whitireia New Zealand
WIT = Wellington Institute of Technology (Weltec)
Everything else is Other
The p-value is now highly significant. Screeplot indicates 2 dimensions.

Wellington Institute of Technology (Weltec) closest aligned to security and hr. Massey University (Wellington) with design. Everthing else is lumped together, perhaps to align Auckland University with data analyst.

intern$Institution.abbv <- revalue(intern$Institution, 
        c("Auckland Institute of Studies" = "AIS",
          "Auckland University" = "AU",
           "Auckland University of Technology" = "AUT",
           "Computer Power Plus" = "CPP",
           "Enspiral Dev Academy" = "EDA",
           "Massey University (Wellington)" = "MUW",
           "Media Design School"  = "MDS",
           "Toi Ohomai Institute of Technology" = "TOIT",
           "Unitec Institute of Technology" = "UIT",
           "University of Canterbury" = "UC",
           "University of Otago" = "UO",
           "University of Waikato" = "UW",
           "Victoria University of Wellington" = "VU",
           "Waikato University" = "WU",
           "Wellington ICT Graduate School (WICTGS)" = "WIGS",
           "Wellington Institute of Technology (Weltec)" = "WIT",
           "Whitireia New Zealand"  = "W" ,
           "Yoobee School of Design" = "Y"))

intern$internship.type.abbv <- revalue(intern$internship.type,
        c("Business Analyst" = "BA",
        "data analyst" = "DA",
        "design" = "D",
        "DevOps" = "DO",
        "electronics" = "E",
        "hr" = "H",
        "information management" = "IM",
        "infrastructure" = "I",
        "IT consultant" = "ITC",
        "IT support" = "ITS",
        "marketing" = "M",
        "mobile developer" = "MD",
        "product manager" = "PDM",
        "Project Manager" = "PJM",
        "security" = "S",
        "software developer" = "SD",
        "systems engineer" = "SE",
        "testing" = "T",
        "VR/AR developer" = "VR",
        "web developer" = "WD"))

intern.i.it <- intern %>%
  select(Institution.abbv, internship.type) %>%
  mutate(Institution.ch.abbv = as.character(Institution.abbv), 
         internship.type.ch = as.character(internship.type)) %>%
  mutate(Institution = ifelse(Institution.ch.abbv %in% c("VU",
                                                "AU",
                                                "MUW",
                                                "UC",
                                                "W",
                                                "WIT"),
                             Institution.ch.abbv,
                             "Other"),
         internship.type = ifelse(internship.type.ch %in% c("software developer",
                                                         "design",
                                                         "marketing",
                                                         "IT consultant",
                                                         "hr",
                                                         "mobile developer",
                                                         "security",
                                                         "data analyst"),
                                  internship.type.ch,
                                  "Other")) %>%
  mutate(Institution = as.factor(Institution), 
         internship.type = as.factor(internship.type))

table.i.it <- with(intern.i.it, table(internship.type,Institution)) 
intern.ca <- CA(table.i.it, graph = FALSE)
print(intern.ca)

fviz_screeplot(intern.ca, addlabels = TRUE)
fviz_ca_biplot(intern.ca, repel = TRUE, title="Correspondence Analysis: Internship Type v Institution for Successful Interns")

ca_intern_institution

Correspondence analysis between Programme and simplified intern type

This is also highly significant. Dimension 1 accounts for over 90% of variation, the second is much smaller at around 6%. Summer of Biz and Sot Jobs (Biz) are closely aligned with hr and marketing. The screeplot indicates 1 dimension.

intern.i.it <- intern %>%
  select(Programme, internship.type) %>%
  mutate(Programme = as.character(Programme), 
         internship.type = as.character(internship.type)) %>%
  mutate(internship.type = ifelse(internship.type %in% c("software developer",
                                                         "design",
                                                         "marketing",
                                                         "IT consultant",
                                                         "hr",
                                                         "mobile developer",
                                                         "security",
                                                         "data analyst"),
                                  internship.type,
                                  "Other")) %>%
  mutate(Institution = as.factor(Programme), 
         internship.type = as.factor(internship.type))
table.p.it <- with(intern.i.it, table(internship.type,Programme)) 
intern.ca <- CA(table.p.it, graph = FALSE)
print(intern.ca)

fviz_screeplot(intern.ca, addlabels = TRUE)
fviz_ca_biplot(intern.ca, repel = TRUE, title="Correspondence Analysis: Internship Type v Programme for Successful Interns")

ca_intern_programme

Correspondence analysis between Programme and simplified Institution

Significant at 0.05. Screeplot probably recommends 3 dimensions.

intern.p.i <- intern %>%
  select(Institution.abbv, Programme) %>%
  mutate(Institution.ch.abbv = as.character(Institution.abbv)) %>%
  mutate(Institution = ifelse(Institution.ch.abbv %in% c("VU",
                                                "AU",
                                                "MUW",
                                                "UC",
                                                "W",
                                                "WIT"),
                             Institution.ch.abbv,
                             "Other")) %>%
  mutate(Institution = as.factor(Institution))
table.p.i <- with(intern.p.i, table(Institution,Programme)) 
intern.ca <- CA(table.p.i, graph = FALSE)
print(intern.ca)

fviz_screeplot(intern.ca, addlabels = TRUE)
fviz_ca_biplot(intern.ca, repel = TRUE, title="Correspondence Analysis: Internship Type v Programme for Successful Interns")

ca_inst_programme

kimnewzealand commented 6 years ago

Thanks for contributing. However the purpose of this repo is to share dataviz for others to learn and share EDA type dataviz, and to help SoT with their questions. This detailed statistical analysis may be quite detailed for this initial look at the data.

I see you are anonymous, please could you add your name to your GitHub page?

cabbagetreecustard commented 6 years ago

I have added my name now. I can take this down if you feel it is not suitable.

kimnewzealand commented 6 years ago

Thanks, we will leave it here absolutely as all contributions are useful as everyone looks at data differently and has a different perspective. So I should have said that up front.

That it was detailed was just my feedback.

On another note I see you are using R Markdown, did using the R notebook work ?

cabbagetreecustard commented 6 years ago

Cool, comments taken on board, feedback is one of the most important parts of writing code.

Regarding RMarkdown, I upgraded to R 3.5 which fixed my issues, thanks.