[x] Given NN with learned weights, try to remove the part with sensitive attributes, so that the part with permissible attributes gives you some predictions of output
[x] Check if the predictions satisfies some fairness definitions (or at what extent they satisfies)
[ ] Use your current version of code to add fairness metrics to the results when working only on permissible attributes (It can be our benchmark: what are the fairness degree if we just ignore sensitive attributes)
[x] Modify your code so that you save weights that the NN learned when all attributes are available and then load them for only_permissible model
[ ] Modify your loss function by adding some penalty (see comment below)
The loss in this case is the sum of accuracy and penalties. I suggested penalty1 to be the norm of output_sensitive. Penalty two can be similarity between output_sensitive and output_permissible that can be calculated using cosine. https://en.wikipedia.org/wiki/Cosine_similarity
You can consider combinations of these two penalties and you can suggest other penalties to distinguish the impacts of sensitive and permissible attributes. The performance of the method with different penalties is evaluated using fairness metrics (and we can track the accuracy)