dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
182 stars 61 forks source link

fix sorting of dataframe for aequitas calculations during evaluations #858

Closed shaycrk closed 3 years ago

shaycrk commented 3 years ago

Currently, the attributes for bias analysis via aequitas are getting scrambled relative to the scores and labels when the latter get sorted for "best case" and "worst case" analyses. To fix this issue, this PR sorts the index (e.g., the entity_id, as_of_date tuple) as well, then applies this re-sorted index to the dataframe with the attributes used for aequitas calculations.

Also note that I added a check that raises a ValueError if there is a mismatch between the indices for the protected_df and labels (the must have the same shape and sets of unique values). This will enforce that protected_df covers the full set of entities and has no duplicates, which I don't think is an unreasonable requirement for running the bias analysis, but if anyone disagrees, we can remove the check.