autodeployai / pmml4s

PMML scoring library for Scala
https://www.pmml4s.org/
Apache License 2.0
62 stars 10 forks source link

Inconsistent value between predicted value and probabilities #24

Closed tmbrye closed 2 years ago

tmbrye commented 2 years ago

When scoring a model I am seeing an inconsistency in the predicted results. When running the score, the prediction column will state it is predicting a 1 result yet the probability for 0 is higher than the probability for 1 in several cases in the data. The python code I used to test was as follows:

import pandas as pd
import pypmml
from pypmml import Model

print(pypmml.__version__)
#lst = [[1,0,0,0,0,0,0,1,1,0,1,0,80,120,0,53,1,0]]
lst = [
[1,0,0,0,0,0,0,1,1,0,1,0,80,120,0,53,1,0],
[1,0,0,0,0,1,0,1,1,0,1,0,80,120,0,40,0,0],
[1,0,1,0,0,0,0,1,0,0,1,0,90,160,0,64,1,0],
[1,0,0,0,0,0,0,1,1,1,1,0,80,120,0,57,0,0],
[1,0,0,0,0,0,0,0,1,1,1,0,80,120,0,52,0,0],
[0,1,0,1,0,0,0,1,1,0,0,0,90,160,0,62,1,0],
[0,1,1,0,1,0,0,1,0,1,1,0,74,110,0,46,0,0],
[1,0,0,0,0,0,0,1,1,1,1,0,60,100,0,39,0,0],
[1,0,1,0,0,0,0,1,0,0,1,0,80,120,0,63,1,0],
[0,1,0,1,0,1,0,1,1,0,0,0,120,180,0,54,0,0],
[1,0,0,0,0,1,1,1,1,0,0,0,95,149,1,44,0,0]]

df = pd.DataFrame(lst, columns=['gluc1','gluc2','gender_female','cholesterol2','smoke','over_weight','cholesterol3','active','gender_male','normal_weight','cholesterol1','under_weight','ap_lo','ap_hi','alco','age','obesity','gluc3'])
m = Model.load('random_forest.pmml')
res = m.predict(df)
print(res)

The model is a rather large model. Here is the output of the above code executed:

0.9.16
   prediction   proba_0   proba_1
0           1  0.648421  0.351579
1           0  0.752614  0.247386
2           1  0.192443  0.807557
3           1  0.635218  0.364782
4           0  0.719272  0.280728
5           1  0.190509  0.809491
6           0  0.792876  0.207124
7           0  0.840534  0.159466
8           1  0.517247  0.482753
9           1  0.167679  0.832321
10          1  0.168039  0.831961

Note 0,3,and 8 are incorrect.

random_forest.pmml.zip .

scorebot commented 2 years ago

@tmbrye The results are indeed correct, open the attached model, we can see the definitions of output fields:

          <Output>
            <OutputField dataType="string" feature="decision" name="prediction" optype="categorical" value="">
              <Apply function="if">
                <Apply function="greaterThan">
                  <FieldRef field="prediction_1"/>
                  <Constant dataType="double">0.35</Constant>
                </Apply>
                <Constant dataType="string">1</Constant>
                <Constant dataType="string">0</Constant>
              </Apply>
            </OutputField>
            <OutputField dataType="double" feature="probability" name="proba_0" optype="continuous" value="0"/>
            <OutputField dataType="double" feature="probability" name="proba_1" optype="continuous" value="1"/>
          </Output>

The threshold of prediction is 0.35, that means if the probability of 1 is greater than 0.35, the final prediction will be 1, so the records 0, 3, and 8 are correct

tmbrye commented 2 years ago

Thanks so much for your quick response! This model was passed along to me and I definitely didn't dive in as far as I should have to notice that. Appreciate your time.