Closed NicolasMontes closed 4 years ago
This depends. In case of the OCSVM (P-classifier) this is true, unlabeled data is only used for PU-performance metric calculation and via the metrics for model/parameter selection but not for training the classifier. In case of BSVM or Maxent (PU-classifiers) the unlabeled data is also used for training the classifier not only for model/parameter selection. Does this help? Please close the issue if this clarifies your question.
It has been very helpful, thank you very much! Excellent library, but I have some more questions:
I have a dataset with 45930 positives samples and 20477 unlabeled. I used an OCSVM to find the parameters, using the following code:
tr_index <- createFolds (tr_y, k = 2, returnTrain = TRUE)
ocsvm.fit <- trainOcc (x = tr_x, y = tr_y, method = "ocsvm", index = tr_index)
The final model is the one with the highest puF... is the default threshold set to 0 to make that decision?
Another question related to the plot diagnosis: The dark blue boxplot are the observations that were left out when the model was trained... To wich fold correspond? The ligth blue boxplot are the observations that were classified as positive in the unlabeled validation data set, and the gray boxplot are all predictions in the whole data set, rigth?
Yes, in this case each parameter combination is only evaluated at the default threshold of 0.
The dark blue boxplot contains all hold-out samples. If you do a 5-fold CV each fold will be hold out once. All hold-out samples of all hold-out folds are collected to build the darkblue boxplot.
The lightblue samples are also all samples predicted by the final model fitted by all these samples. The grey boxplot are all the unlabeled samples (also hold out in case you use BSVM with CV). In case of OCSVM they are anyway all hold out always and only used for metric computation.
Thank you!
Hello, I have a question in the first example of "One-class classification in R with the oneClass package".... Do you train a model with 20 positive samples and use the 500 unlabeled to calculate the performance? and then find the best model? Thank you