predictions - Githubissues

AlineTalhouk / splendid

Supervised Learning Ensemble for Diagnostic Identification

https://alinetalhouk.github.io/splendid/

Other

1 stars 0 forks source link

predictions #21

Closed AlineTalhouk closed 7 years ago

AlineTalhouk commented 7 years ago

@dchiu911 do you know how to go from probability to class prediction? is it the highest probability? is there a way to set a threshold?

dchiu911 commented 7 years ago

Yes that is how class predictions are made. What do you mean by setting a threshold?

AlineTalhouk commented 7 years ago

I mean saying that if probability not greater than 0.5 for any class assign "unclassified" label

dchiu911 commented 7 years ago

what happens to the predicted class then

dchiu911 commented 7 years ago

sorry I meant, why are we doing this

AlineTalhouk commented 7 years ago

What this is supposed to do is to only classify the cases for which you are sure about the classification. You can change the threshold to match the amount of sensitivity/specificity needed for the clinical application

dchiu911 commented 7 years ago

How does this impact evaluation metrics then? Do we only compare the classified predictions with the corresponding true class labels?

AlineTalhouk commented 7 years ago

yes. Isn't that what we currently do?

AlineTalhouk commented 7 years ago

I mean the probabilities don't change only the class predictions

dchiu911 commented 7 years ago

Well currently, all predictions are classified. For example, looking at one run of xgboost, if we set a threshold of 0.5 the number of classified predictions decreases to 40% of the original test sample.

Yes probabilities don't change, but class predictions do and they form the confusion matrices from which evaluation metrics are calculated.

Say, for example, instead of having c(4, 4, 3, 1, 2, 3) after thresholding I might have c(4, NA, NA, 1, 2, 3), in which case I would remove the corresponding indices (2nd and 3rd) in the true class label before constructing the confusion matrix.

AlineTalhouk commented 7 years ago

yes but you just ignore those cases (you filter them out) so in your example you exclude the 2nd and third element that correspond to NA

dchiu911 commented 7 years ago

yeah that's what I was asking here

Do we only compare the classified predictions with the corresponding true class labels?

I just want to touch base on all propagating side-effects before modifying the code so we're on the same page.

AlineTalhouk commented 7 years ago

yes. You would have to create a label to filter on "unclassifiable" not NA to not be confused with an error. We should be able to dial this down to default, that will classify all cases and threshold which will make more strict rules on classification. This can be optimized by ROC curves say

dchiu911 commented 7 years ago

Yes I was planning to use "unclassifiable" or some other string and not NA, that was just an example. I will keep the complete predictions, and add an attr maybe called class_threshold that shows the same predictions but with some unclassified cases, and also an attr for the proportion of classified cases maybe called proportion

dchiu911 commented 7 years ago

Implementation changes:

Predicted classes with unclassified cases based on threshold value (default = 0.5) for lowest max class probability added as attr named class.thres
- Makes use of infix operator purrr::`%@%`
Proportion of classified cases added as attr named class.prop
If algorithm classifies into n' < n classes where n is number of true classes (e.g. qda), add back missing levels to the class factor to ensure confusion matrix is square
- New imported package forcats might need to be installed first
If threshold is too high and all samples are unclassified (e.g. xgboost), use the original predicted class labels for evaluation