Open gabastil opened 2 years ago
I don't think that returning 0 when there is a divide by zero error is the correct solution. https://en.wikipedia.org/wiki/Division_by_zero
Sure, it's not technically, mathematically the correct solution—that should be inf
, right?
However, in a context by context basis, something needs to returned that does not collapse the entire fit process. Numpy's solution is to default all nan
values to zero (see numpy.nan_to_num
). As this is an array of accuracy, when there are no TN
counts, any predicted negatives would be "inaccurate", which would be the reasoning for returning a zero.
That thought process in addition to the nan
handling function in numpy led me to that proposed workaround.
If there's a better solution, happy to have that implemented as well. Otherwise, I can submit a PR for what I mentioned.
As long as the nan
issue arising from fit()
does not derail the entire pipeline, I am happy!
It looks like this is what sklearn does (e.g. recall_score) and what we do in the sklearn-compatible metrics as well.
However, it seems like a really breaking issue if your dataset has no positive or no negative samples at all. What's the point of running a debiasing algorithm on such a dataset?
The dataset does have positive or negative samples, but I see where you are coming from.
Regardless, even if it's "failing gracefully" like a helpful message describing the error (i.e., a division by zero error at point of division), it'd be helpful to see that.
Even better, before even getting to the loop, if it's known that metrics cannot be calculated properly, an assert
statement with a helpful message might be more helpful than letting both classification_threshold
and ROC_margin
loops run through knowing that nan
values they are generating will fail the function.
In the case of a zero in the denominator of the rate functions (e.g.,
true_negative_rate
), the fit function throws an error.In my case, the
classified_transf_metric.true_negative_rate()
function in lines 143 and 144 ofaif360/algorithms/postprocessing/reject_option_classification.py
causes the entire expression to return asnan
as it returnsnan
in the addition operation that is assigned tobalanced_acc_arr[cnt]
.The above, in turn, causes downstream operations to throw errors.
I propose returning zero and sending a warning message that this has occurred, but not sure if that is the best route forward for this issue.