TSFelg / fairly

Fairly is a tool to help tech workers residing in Portugal know if they're being paid fairly.
30 stars 3 forks source link

Fairly is a tool to help tech workers living in Portugal know if they're being paid fairly. Fairly models the probabilistic distribution of the annual gross salary conditioned on several input features such as the job, years of experience and the company's country location. The app then allows new users to specify their charateristics and understand how they are positioned within the conditional distribution. Fairly is being developed in the context of the Landing.Jobs Data Challenge which aims to generate knowledge based on the Tech Careers Report 2021.

Stack

Data

The data used in this project comes from the Tech Careers Report 2021 which gathered more than 3000 answers from tech workers in Portugal. The report collected more than 100 variables but not all of these are relevant for answering the question: "Are you paid fairly?". Adding to this the fact that when the app is deployed the objective is for it to be easy for users to get their answer, this makes it important to find the right balance between data exhaustiveness and model performance. Given this, there are three main reasons why features were discarded:

After this selection the final features used to train the model are: Working Experience, English Level, Residence District, Education Level, Company Country, Company Type, Employment Status, Job Role. The first two were ordinally encoded and the latter were one-hot encoded.

Modelling

Fairly models the probabilistic distribution of the gross annual salary conditioned on the input features. With a deterministic model we could only say if the user is paid more or less than the average of the population with its profile. But given the natural variance in the data this wouldn't be very informative, you may be paid more or less because of some variables we're not conditioning on or simply due to the aleatoric uncertainty in the data. But by modelling the full conditional distribution we enable questions such as: "Is my salary lower than 90% of the population with the same profile?". These are empowering questions that can help users know if they're paid fairly.

To model the conditional distribution we used ngboost. This model takes the high performance of gradient boosting algorithms coupled with natural gradients to learn multi-parameter distributions. The chosen distributions to model were the LogNormal, Normal, and Laplace. For each of these distributions the models were trained using grid search with cross-validation to explore the hyper-parameter space of the ngboost models. The results of the best model according to the negative log likelihood (NLL) were chosen and tested for the mean absolute error (MAE) and the root mean squared error (RMSE) as well. As can be seen on the table below the LogNormal distribution obtained the best results not only in terms of the NLL, which was expected given that it is the distribution that best fits the target distrbution, but also on the deterministic scores. This highlights the idea that calibrated probabilistic models can also be of interest for deterministic tasks.

LogNormal Normal Laplace
NLL 10.66 10.88 10.76
MAE 9474 9644 9509
RMSE 13735 13844 14043

Model Explainability

To understand how the model is learning to estimate the conditional distribution we ran a shap analysis on the deployed model. The image below allows us to understand what are the main features the model took into account and how they impact the predictions. For example, the working experience is the feature that most impacts model predictions which is aligned with our expectations: the more years of experience the higher the salary. There are other expected cases such as higher salaries for those working in Lisbon, those with Msc degrees and CTOs.