donlnz / nonconformist

Python implementation of the conformal prediction framework.
MIT License
436 stars 96 forks source link

Negative p-values for TCP #10

Closed gchers closed 6 years ago

gchers commented 6 years ago

Hi Henrik. Thanks (again) for the library!

I noticed negative p-values when using smoothed TCP. I am currently not able to provide a compact reproducible example of this, but I found the following errors in the implementation.

TCPClassifier calls ICPClassifier.predict() to compute p-values, but:

1) If self.smoothing=True, self.base_icp is also instantiated with self.smoothing. This is bad, because ICP multiplies n_eq by the random number in [0,1], but then TCP removes a new random number in [0,1]; this will clearly not obtain the desired result. To fix this particular issue, I would suggest ICPClassifier.predict() to have a return_stats boolean option, which if True returns, instead of the p-values, n_eq and n_gt, so that the caller can compute the p-values as it pleases. However, this is probably not the best solution (see below).

2) ICPClassifier.predict() computes p-values for all classes, which are then immediately discarded by TCPClassifier.predict(). https://github.com/donlnz/nonconformist/blob/9a7004e098d81b54b75e3c4808d704abafa4bd49/nonconformist/cp.py#L137 This is clearly bad computationally, although I don't think it causes any misbehaviour in the algorithm.

3) It seems to me that TCPClassifier.predict() appends x[i,:] to train_x, but then also makes a prediction including x[i,:], effectively obtaining two nonconformity scores for such object. The actual algorithm should compute a nonconformity score for z_i as A(Z \ z_i, z_i) (i.e., excluding z_i from the training set). This should matter for small sample sizes.

IMHO, I think it'd be better to just implement TCP on its own, rather than calling ICPClassifier.calibrate() and ICPClassifier.predict(). I'll send you a PR with a fix suggestion.

donlnz commented 6 years ago

Hey Giovanni, thanks for your input (and the patch)!

I agree that your implementation makes a lot more sense—mine was simply the laziest alternative I could think of at the time... :)

Regarding your third point, you are right in that the current solution does not compute nonconformity scores in a leave-one-out fashion. It is not obvious to me which would be the most reasonable default behavior; on the one hand, computing scores using leave-one-out is strictly better in terms of informational efficiency (though this does depend on the underlying model—and unstable model, e.g., a decision tree, will benefit more than something like a support vector machine), but on the other hand the computational complexity suffers greatly, of course. I suppose that the current implementation is mostly suited towards a very particular case: when the data set is too small for an ICP but too large for a leave-one-out TCP, which should already be handled reasonably well by a well-configured CCP/BCP/ACP...

So, yes, it probably makes more sense to compute the TCP nonconformity scores using leave-one-out by default; although, it might also be reasonable to be able to optionally select the current behavior through a constructor parameter. If you'd like to submit another patch addressing this, I'd be happy to take a look, otherwise I'll look into it as soon as I have the time.

As an aside, your suggestion regarding a return_stats type of behavior might still be beneficial, even if it is not used for TCP; the current implementation of CCP and BCP is slightly wonky (in the same way as the current TCP implementation, i.e., with regard to the smoothing parameter). It would probably make sense to consolidate the p-value computations into a single implementation (another thing I'll have to look into)—after all, as long as we know n_gt and n_eq they're all effectively the same. A single implementation would also make extensions, e.g. interpolation, simpler to implement.