autodeployai / pmml4s

PMML scoring library for Scala
https://www.pmml4s.org/
Apache License 2.0
62 stars 10 forks source link

Rounding error when outputting "probability" fields #13

Closed abarbet-zz closed 3 years ago

abarbet-zz commented 3 years ago

I trained an SVM using sklearn with the probability option set to True. This instructs the classifier to add Platt scaling accessible through the classifier's predict_proba method to assign a probability to each of the possible classes. When I serialize this model into PMML, I'm getting OutputFields that match what I'd expect to see, i.e.,

<Output> <OutputField name="probability_0" optype="continuous" dataType="double" feature="probability" value="0"/> <OutputField name="probability_1" optype="continuous" dataType="double" feature="probability" value="1"/> <OutputField name="probability_2" optype="continuous" dataType="double" feature="probability" value="2"/> <OutputField name="predicted_target" optype="categorical" dataType="integer" feature="predictedValue"/> </Output>

However, any time I call the PMML4S predict method on this model, I seem to be getting rounding errors with these first three probability output fields. Each probability is always being rounded to some multiple of 1/3. For example, if the predicted class is 0, then the probability looks like [0.6666666666666666, 0.0, 0.3333333333333333, 0]. The same applies for any predicted class, where each probability is a multiple of 1/3.

I've checked all the variables in my PMML file, and I know that the input to the PMML is correct (i.e. the rounding errors aren't occurring there). Do you know what could be causing this?

scorebot commented 3 years ago

@abarbet Oh, it's an old known question, see the comments below.

scorebot commented 3 years ago

@abarbet More info about the probability of SVMs, which do not directly provide a probability. By default, the SVM model gives you a voting-based "pseudo" probability distribution. I know sklearn introduces Platt Scaling that internally uses cross-validation to obtain a probability, but the exported PMML has no info about them, so the probability is just a multiple of 1/3.

scorebot commented 3 years ago

I close this issue now. if you have other problems, please feel free to open a new one.