TSFelg / fairly

Fairly is a tool to help tech workers residing in Portugal know if they're being paid fairly.
30 stars 3 forks source link

Improve model transparency #5

Open TSFelg opened 3 years ago

TSFelg commented 3 years ago

Fairly is only useful if the users feel they can trust the results. This is specially important in corner cases where there are few data points in the training distribution and the model may not have learned the specific context as well. Currently, what should happen in these cases is that the prediction bands widen due to the uncertainty in the training data, either due to high variance or due to low number of data points.

Although the previous already helps users understand how confident is the model of the prediction (higher bands -> more uncertainty) it would still be useful to make the model more transparent. Some possible approaches are listed below:

Histogram Show the histogram for the specific input context below the distribution. This is the most straightforward approach. It's interesting, but I have my doubts it would work due to the low amount of data for several cases. In a way, it actually opposes to the fundamental idea of the modelling which is to trust the generalization capabilities of the model. Using a histogram the information about two input contexts that are exactly the same but only differ in one feature can't be leveraged for one another. Having said that, it should be interesting to test it.

Embedding The idea here would be to learn the 2D embedding of the input space and allow the user to explore it. When hovering a point it would show the corresponding input context. The color corresponds to the salary. Below is a quick poc: image

Basically, this would be an unsupervised alternative to the current approach, since it allows the user to look at the raw input space it would be more transparent and allow him to quickly see the most similar users to himself and the corresponding salaries. I believe it can be an interesting auxiliary approach, specially for corner cases.

Calibration Contrary to the previous, these one simply corresponds to continuing the current probabilistic modelling approach but looking more carefully into the model calibration and how it differs across input contexts.

chrismolli commented 3 years ago

I like the embedding idea. You could further introduce a distance metric to the closest n data points within the embedded projection. Also one could check if the "new" datapoint is within or outside of the area given by the datapoints at the border to see if the model may be interpolating or extrapolating the inference. Checkout "chart.js" for some nice interactive plotting ;)