UBC-MDS / Absenteeism_at_Work

MIT License
1 stars 4 forks source link

Absenteeism Hours Predictor

About

In this project, we built three machine learning regression models: random forest regressor, support vector machine regressor with linear kernel and ridge regressor to make predictions on absenteeism time in hours from the “Absenteeism at work” dataset. Our final model support vector machine regressor with linear kernel performed a decent job on an unseen test data set, with negative RMSE score of -5.966. On 222 test data cases, the average hours that our model missed to predict is 5.966 hours, which is not bad at all. However, in both the train and test dataset, our predictor tends to over predict when the actual absenteeism hours are low and under predict in the case of actual absenteeism hours are high. Since our prediction results may affect the decision and judgement that an employer makes when dealing with absenteeism among employees, we suggest that more sophisticated approaches on machine learning algorithm and feature selection should be conducted to improve the prediction model before it is being used to direct on absenteeism issues at the workplace.

Our data set is chosen from the UCI Machine Learning Repository called “Absenteeism at work Data Set.” The data set can be found here and it is created by Andrea Martiniano, Ricardo Pinto Ferreira, and Renato Jose Sassi from Postgraduate Program in Informatics and Knowledge Management at Nove de Julho University, Rua Vergueiro(Andrea Martiniano, Ricardo Pinto Ferreira, and Renato Jose Sassi 2010). The data was collected at a courier company in Brazil and the database includes the monthly records of absenteeism of 36 different workers over three years, starting from July 2007, and how their changes affect their absence rate over time. This data set contains 740 instances with 21 attributes, including 8 categorical and 9 numerical features (excluding the target Absenteeism time in hours and the drop feature ID, Disciplinary failure, Body mass index, Service time, and Month of absence). Each row represents information about an employee with his/her situations of absence, family, workload, and other factors that might be related to his/her absence at work. Out of the considered attributes, the absenteeism in hours is the target to predict with the provided information, and the features are:

Report

The final report can be found here

Usage

There are two suggested ways to run this analysis:

1. Using Docker

note - the instructions in this section also depends on running this in a unix shell (e.g., terminal or Git Bash)

To replicate the analysis, install Docker. Then clone this GitHub repository and run the following command at the command line/terminal from the root directory of this project:

docker run --rm -v /$(pwd):/home/rstudio/project yikisu/absenteeism_project:latest make directory=/home/rstudio/project all

To reset the repo to a clean state, with no intermediate or results files, run the following command at the command line/terminal from the root directory of this project:

docker run --rm -v /$(pwd):/home/rstudio/project yikisu/absenteeism_project:latest make directory=/home/rstudio/project clean

2. Without using Docker

To replicate the analysis, clone this GitHub repository, install the dependencies listed below, and run the following command at the command line/terminal from the root directory of this project:

make all

To reset the repo to a clean state, with no intermediate or results files, run the following command at the command line/terminal from the root directory of this project:

make clean

Dependency Diagram of the Makefile

Dependencies

References

Andrea Martiniano, Ricardo Pinto Ferreira, and Renato Jose Sassi. 2010. “UCI: Machine Learning Repository.” Universidade Nove de Julho - Postgraduate Program in Informatics; Knowledge Management. .