autodeployai / pmml4s

PMML scoring library for Scala
https://www.pmml4s.org/
Apache License 2.0
62 stars 10 forks source link

Unable to score data with nulls in 0.9.10 which was working in 0.9.9 #15

Closed soumyava closed 3 years ago

soumyava commented 3 years ago

Thanks for the last fix for the critical thread safety issue. After upgrading to 0.9.10 we are facing an issue while predicting datasets with null values through Java. I was able to see the same behavior through pypmml 0.9.10.

Here is my model (model.pmml)

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
    <Header>
        <Application name="JPMML-SkLearn" version="1.6.12"/>
        <Timestamp>2021-05-10T18:48:04Z</Timestamp>
    </Header>
    <MiningBuildTask>
        <Extension>PMMLPipeline(steps=[('mapping', DataFrameMapper(drop_cols=[],
                features=[(['age'], StandardScaler()),
                          (['workclass'], LabelEncoder()),
                          (['marital_status'], LabelEncoder()),
                          (['relationship'], LabelEncoder()),
                          (['race'], LabelEncoder()),
                          (['occupation'], LabelEncoder()),
                          (['native_country'], LabelEncoder())])),
       ('impute', SimpleImputer(strategy='most_frequent')),
       ('clf', LogisticRegression(max_iter=1000, multi_class='ovr', random_state=0,
                   solver='liblinear'))])</Extension>
    </MiningBuildTask>
    <DataDictionary>
        <DataField name="y" optype="categorical" dataType="string">
            <Value value="&lt;=50K"/>
            <Value value="&gt;50K"/>
        </DataField>
        <DataField name="age" optype="continuous" dataType="double"/>
        <DataField name="workclass" optype="categorical" dataType="string">
            <Value value="?"/>
            <Value value="Federal-gov"/>
            <Value value="Local-gov"/>
            <Value value="Never-worked"/>
            <Value value="Private"/>
            <Value value="Self-emp-inc"/>
            <Value value="Self-emp-not-inc"/>
            <Value value="State-gov"/>
            <Value value="Without-pay"/>
        </DataField>
        <DataField name="marital_status" optype="categorical" dataType="string">
            <Value value="Divorced"/>
            <Value value="Married-AF-spouse"/>
            <Value value="Married-civ-spouse"/>
            <Value value="Married-spouse-absent"/>
            <Value value="Never-married"/>
            <Value value="Separated"/>
            <Value value="Widowed"/>
        </DataField>
        <DataField name="relationship" optype="categorical" dataType="string">
            <Value value="Husband"/>
            <Value value="Not-in-family"/>
            <Value value="Other-relative"/>
            <Value value="Own-child"/>
            <Value value="Unmarried"/>
            <Value value="Wife"/>
        </DataField>
        <DataField name="race" optype="categorical" dataType="string">
            <Value value="Amer-Indian-Eskimo"/>
            <Value value="Asian-Pac-Islander"/>
            <Value value="Black"/>
            <Value value="Other"/>
            <Value value="White"/>
        </DataField>
        <DataField name="occupation" optype="categorical" dataType="string">
            <Value value="?"/>
            <Value value="Adm-clerical"/>
            <Value value="Armed-Forces"/>
            <Value value="Craft-repair"/>
            <Value value="Exec-managerial"/>
            <Value value="Farming-fishing"/>
            <Value value="Handlers-cleaners"/>
            <Value value="Machine-op-inspct"/>
            <Value value="Other-service"/>
            <Value value="Priv-house-serv"/>
            <Value value="Prof-specialty"/>
            <Value value="Protective-serv"/>
            <Value value="Sales"/>
            <Value value="Tech-support"/>
            <Value value="Transport-moving"/>
        </DataField>
        <DataField name="native_country" optype="categorical" dataType="string">
            <Value value="?"/>
            <Value value="Cambodia"/>
            <Value value="Canada"/>
            <Value value="China"/>
            <Value value="Columbia"/>
            <Value value="Cuba"/>
            <Value value="Dominican-Republic"/>
            <Value value="Ecuador"/>
            <Value value="El-Salvador"/>
            <Value value="England"/>
            <Value value="France"/>
            <Value value="Germany"/>
            <Value value="Greece"/>
            <Value value="Guatemala"/>
            <Value value="Haiti"/>
            <Value value="Honduras"/>
            <Value value="Hong"/>
            <Value value="Hungary"/>
            <Value value="India"/>
            <Value value="Iran"/>
            <Value value="Ireland"/>
            <Value value="Italy"/>
            <Value value="Jamaica"/>
            <Value value="Japan"/>
            <Value value="Laos"/>
            <Value value="Mexico"/>
            <Value value="Nicaragua"/>
            <Value value="Outlying-US(Guam-USVI-etc)"/>
            <Value value="Peru"/>
            <Value value="Philippines"/>
            <Value value="Poland"/>
            <Value value="Portugal"/>
            <Value value="Puerto-Rico"/>
            <Value value="Scotland"/>
            <Value value="South"/>
            <Value value="Taiwan"/>
            <Value value="Thailand"/>
            <Value value="Trinadad&amp;Tobago"/>
            <Value value="United-States"/>
            <Value value="Vietnam"/>
            <Value value="Yugoslavia"/>
        </DataField>
    </DataDictionary>
    <TransformationDictionary/>
    <RegressionModel functionName="classification" normalizationMethod="logit">
        <MiningSchema>
            <MiningField name="y" usageType="target"/>
            <MiningField name="age"/>
            <MiningField name="workclass" missingValueReplacement="4.0" missingValueTreatment="asMode"/>
            <MiningField name="marital_status" missingValueReplacement="2.0" missingValueTreatment="asMode"/>
            <MiningField name="relationship" missingValueReplacement="0.0" missingValueTreatment="asMode"/>
            <MiningField name="race" missingValueReplacement="4.0" missingValueTreatment="asMode"/>
            <MiningField name="occupation" missingValueReplacement="4.0" missingValueTreatment="asMode"/>
            <MiningField name="native_country" missingValueReplacement="38.0" missingValueTreatment="asMode"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability(&lt;=50K)" optype="continuous" dataType="double" feature="probability" value="&lt;=50K"/>
            <OutputField name="probability(&gt;50K)" optype="continuous" dataType="double" feature="probability" value="&gt;50K"/>
        </Output>
        <LocalTransformations>
            <DerivedField name="standardScaler(age)" optype="continuous" dataType="double">
                <Apply function="/">
                    <Apply function="-">
                        <FieldRef field="age"/>
                        <Constant dataType="double">38.538730992755355</Constant>
                    </Apply>
                    <Constant dataType="double">13.526761732538716</Constant>
                </Apply>
            </DerivedField>
            <DerivedField name="encoder(workclass)" optype="categorical" dataType="integer">
                <MapValues outputColumn="data:output">
                    <FieldColumnPair field="workclass" column="data:input"/>
                    <InlineTable>
                        <row>
                            <data:input>?</data:input>
                            <data:output>0</data:output>
                        </row>
                        <row>
                            <data:input>Federal-gov</data:input>
                            <data:output>1</data:output>
                        </row>
                        <row>
                            <data:input>Local-gov</data:input>
                            <data:output>2</data:output>
                        </row>
                        <row>
                            <data:input>Never-worked</data:input>
                            <data:output>3</data:output>
                        </row>
                        <row>
                            <data:input>Private</data:input>
                            <data:output>4</data:output>
                        </row>
                        <row>
                            <data:input>Self-emp-inc</data:input>
                            <data:output>5</data:output>
                        </row>
                        <row>
                            <data:input>Self-emp-not-inc</data:input>
                            <data:output>6</data:output>
                        </row>
                        <row>
                            <data:input>State-gov</data:input>
                            <data:output>7</data:output>
                        </row>
                        <row>
                            <data:input>Without-pay</data:input>
                            <data:output>8</data:output>
                        </row>
                    </InlineTable>
                </MapValues>
            </DerivedField>
            <DerivedField name="encoder(marital_status)" optype="categorical" dataType="integer">
                <MapValues outputColumn="data:output">
                    <FieldColumnPair field="marital_status" column="data:input"/>
                    <InlineTable>
                        <row>
                            <data:input>Divorced</data:input>
                            <data:output>0</data:output>
                        </row>
                        <row>
                            <data:input>Married-AF-spouse</data:input>
                            <data:output>1</data:output>
                        </row>
                        <row>
                            <data:input>Married-civ-spouse</data:input>
                            <data:output>2</data:output>
                        </row>
                        <row>
                            <data:input>Married-spouse-absent</data:input>
                            <data:output>3</data:output>
                        </row>
                        <row>
                            <data:input>Never-married</data:input>
                            <data:output>4</data:output>
                        </row>
                        <row>
                            <data:input>Separated</data:input>
                            <data:output>5</data:output>
                        </row>
                        <row>
                            <data:input>Widowed</data:input>
                            <data:output>6</data:output>
                        </row>
                    </InlineTable>
                </MapValues>
            </DerivedField>
            <DerivedField name="encoder(relationship)" optype="categorical" dataType="integer">
                <MapValues outputColumn="data:output">
                    <FieldColumnPair field="relationship" column="data:input"/>
                    <InlineTable>
                        <row>
                            <data:input>Husband</data:input>
                            <data:output>0</data:output>
                        </row>
                        <row>
                            <data:input>Not-in-family</data:input>
                            <data:output>1</data:output>
                        </row>
                        <row>
                            <data:input>Other-relative</data:input>
                            <data:output>2</data:output>
                        </row>
                        <row>
                            <data:input>Own-child</data:input>
                            <data:output>3</data:output>
                        </row>
                        <row>
                            <data:input>Unmarried</data:input>
                            <data:output>4</data:output>
                        </row>
                        <row>
                            <data:input>Wife</data:input>
                            <data:output>5</data:output>
                        </row>
                    </InlineTable>
                </MapValues>
            </DerivedField>
            <DerivedField name="encoder(race)" optype="categorical" dataType="integer">
                <MapValues outputColumn="data:output">
                    <FieldColumnPair field="race" column="data:input"/>
                    <InlineTable>
                        <row>
                            <data:input>Amer-Indian-Eskimo</data:input>
                            <data:output>0</data:output>
                        </row>
                        <row>
                            <data:input>Asian-Pac-Islander</data:input>
                            <data:output>1</data:output>
                        </row>
                        <row>
                            <data:input>Black</data:input>
                            <data:output>2</data:output>
                        </row>
                        <row>
                            <data:input>Other</data:input>
                            <data:output>3</data:output>
                        </row>
                        <row>
                            <data:input>White</data:input>
                            <data:output>4</data:output>
                        </row>
                    </InlineTable>
                </MapValues>
            </DerivedField>
            <DerivedField name="encoder(occupation)" optype="categorical" dataType="integer">
                <MapValues outputColumn="data:output">
                    <FieldColumnPair field="occupation" column="data:input"/>
                    <InlineTable>
                        <row>
                            <data:input>?</data:input>
                            <data:output>0</data:output>
                        </row>
                        <row>
                            <data:input>Adm-clerical</data:input>
                            <data:output>1</data:output>
                        </row>
                        <row>
                            <data:input>Armed-Forces</data:input>
                            <data:output>2</data:output>
                        </row>
                        <row>
                            <data:input>Craft-repair</data:input>
                            <data:output>3</data:output>
                        </row>
                        <row>
                            <data:input>Exec-managerial</data:input>
                            <data:output>4</data:output>
                        </row>
                        <row>
                            <data:input>Farming-fishing</data:input>
                            <data:output>5</data:output>
                        </row>
                        <row>
                            <data:input>Handlers-cleaners</data:input>
                            <data:output>6</data:output>
                        </row>
                        <row>
                            <data:input>Machine-op-inspct</data:input>
                            <data:output>7</data:output>
                        </row>
                        <row>
                            <data:input>Other-service</data:input>
                            <data:output>8</data:output>
                        </row>
                        <row>
                            <data:input>Priv-house-serv</data:input>
                            <data:output>9</data:output>
                        </row>
                        <row>
                            <data:input>Prof-specialty</data:input>
                            <data:output>10</data:output>
                        </row>
                        <row>
                            <data:input>Protective-serv</data:input>
                            <data:output>11</data:output>
                        </row>
                        <row>
                            <data:input>Sales</data:input>
                            <data:output>12</data:output>
                        </row>
                        <row>
                            <data:input>Tech-support</data:input>
                            <data:output>13</data:output>
                        </row>
                        <row>
                            <data:input>Transport-moving</data:input>
                            <data:output>14</data:output>
                        </row>
                    </InlineTable>
                </MapValues>
            </DerivedField>
            <DerivedField name="encoder(native_country)" optype="categorical" dataType="integer">
                <MapValues outputColumn="data:output">
                    <FieldColumnPair field="native_country" column="data:input"/>
                    <InlineTable>
                        <row>
                            <data:input>?</data:input>
                            <data:output>0</data:output>
                        </row>
                        <row>
                            <data:input>Cambodia</data:input>
                            <data:output>1</data:output>
                        </row>
                        <row>
                            <data:input>Canada</data:input>
                            <data:output>2</data:output>
                        </row>
                        <row>
                            <data:input>China</data:input>
                            <data:output>3</data:output>
                        </row>
                        <row>
                            <data:input>Columbia</data:input>
                            <data:output>4</data:output>
                        </row>
                        <row>
                            <data:input>Cuba</data:input>
                            <data:output>5</data:output>
                        </row>
                        <row>
                            <data:input>Dominican-Republic</data:input>
                            <data:output>6</data:output>
                        </row>
                        <row>
                            <data:input>Ecuador</data:input>
                            <data:output>7</data:output>
                        </row>
                        <row>
                            <data:input>El-Salvador</data:input>
                            <data:output>8</data:output>
                        </row>
                        <row>
                            <data:input>England</data:input>
                            <data:output>9</data:output>
                        </row>
                        <row>
                            <data:input>France</data:input>
                            <data:output>10</data:output>
                        </row>
                        <row>
                            <data:input>Germany</data:input>
                            <data:output>11</data:output>
                        </row>
                        <row>
                            <data:input>Greece</data:input>
                            <data:output>12</data:output>
                        </row>
                        <row>
                            <data:input>Guatemala</data:input>
                            <data:output>13</data:output>
                        </row>
                        <row>
                            <data:input>Haiti</data:input>
                            <data:output>14</data:output>
                        </row>
                        <row>
                            <data:input>Honduras</data:input>
                            <data:output>15</data:output>
                        </row>
                        <row>
                            <data:input>Hong</data:input>
                            <data:output>16</data:output>
                        </row>
                        <row>
                            <data:input>Hungary</data:input>
                            <data:output>17</data:output>
                        </row>
                        <row>
                            <data:input>India</data:input>
                            <data:output>18</data:output>
                        </row>
                        <row>
                            <data:input>Iran</data:input>
                            <data:output>19</data:output>
                        </row>
                        <row>
                            <data:input>Ireland</data:input>
                            <data:output>20</data:output>
                        </row>
                        <row>
                            <data:input>Italy</data:input>
                            <data:output>21</data:output>
                        </row>
                        <row>
                            <data:input>Jamaica</data:input>
                            <data:output>22</data:output>
                        </row>
                        <row>
                            <data:input>Japan</data:input>
                            <data:output>23</data:output>
                        </row>
                        <row>
                            <data:input>Laos</data:input>
                            <data:output>24</data:output>
                        </row>
                        <row>
                            <data:input>Mexico</data:input>
                            <data:output>25</data:output>
                        </row>
                        <row>
                            <data:input>Nicaragua</data:input>
                            <data:output>26</data:output>
                        </row>
                        <row>
                            <data:input>Outlying-US(Guam-USVI-etc)</data:input>
                            <data:output>27</data:output>
                        </row>
                        <row>
                            <data:input>Peru</data:input>
                            <data:output>28</data:output>
                        </row>
                        <row>
                            <data:input>Philippines</data:input>
                            <data:output>29</data:output>
                        </row>
                        <row>
                            <data:input>Poland</data:input>
                            <data:output>30</data:output>
                        </row>
                        <row>
                            <data:input>Portugal</data:input>
                            <data:output>31</data:output>
                        </row>
                        <row>
                            <data:input>Puerto-Rico</data:input>
                            <data:output>32</data:output>
                        </row>
                        <row>
                            <data:input>Scotland</data:input>
                            <data:output>33</data:output>
                        </row>
                        <row>
                            <data:input>South</data:input>
                            <data:output>34</data:output>
                        </row>
                        <row>
                            <data:input>Taiwan</data:input>
                            <data:output>35</data:output>
                        </row>
                        <row>
                            <data:input>Thailand</data:input>
                            <data:output>36</data:output>
                        </row>
                        <row>
                            <data:input>Trinadad&amp;Tobago</data:input>
                            <data:output>37</data:output>
                        </row>
                        <row>
                            <data:input>United-States</data:input>
                            <data:output>38</data:output>
                        </row>
                        <row>
                            <data:input>Vietnam</data:input>
                            <data:output>39</data:output>
                        </row>
                        <row>
                            <data:input>Yugoslavia</data:input>
                            <data:output>40</data:output>
                        </row>
                    </InlineTable>
                </MapValues>
            </DerivedField>
            <DerivedField name="imputer(standardScaler(age))" optype="continuous" dataType="double">
                <Apply function="if">
                    <Apply function="isMissing">
                        <FieldRef field="standardScaler(age)"/>
                    </Apply>
                    <Constant dataType="double">-0.5573197149337588</Constant>
                    <FieldRef field="standardScaler(age)"/>
                </Apply>
            </DerivedField>
            <DerivedField name="continuous(encoder(workclass))" optype="continuous" dataType="integer">
                <FieldRef field="encoder(workclass)"/>
            </DerivedField>
            <DerivedField name="continuous(encoder(marital_status))" optype="continuous" dataType="integer">
                <FieldRef field="encoder(marital_status)"/>
            </DerivedField>
            <DerivedField name="continuous(encoder(relationship))" optype="continuous" dataType="integer">
                <FieldRef field="encoder(relationship)"/>
            </DerivedField>
            <DerivedField name="continuous(encoder(race))" optype="continuous" dataType="integer">
                <FieldRef field="encoder(race)"/>
            </DerivedField>
            <DerivedField name="continuous(encoder(occupation))" optype="continuous" dataType="integer">
                <FieldRef field="encoder(occupation)"/>
            </DerivedField>
            <DerivedField name="continuous(encoder(native_country))" optype="continuous" dataType="integer">
                <FieldRef field="encoder(native_country)"/>
            </DerivedField>
        </LocalTransformations>
        <RegressionTable intercept="-1.0740979028824573" targetCategory="&gt;50K">
            <NumericPredictor name="imputer(standardScaler(age))" coefficient="0.40696744781234445"/>
            <NumericPredictor name="continuous(encoder(workclass))" coefficient="0.03623305747993505"/>
            <NumericPredictor name="continuous(encoder(marital_status))" coefficient="-0.23950788576122492"/>
            <NumericPredictor name="continuous(encoder(relationship))" coefficient="-0.31194325232052256"/>
            <NumericPredictor name="continuous(encoder(race))" coefficient="0.11651425304147206"/>
            <NumericPredictor name="continuous(encoder(occupation))" coefficient="0.04048431885450443"/>
            <NumericPredictor name="continuous(encoder(native_country))" coefficient="-5.45427446319904E-4"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="&lt;=50K"/>
    </RegressionModel>
</PMML>

Here is the dataset I have used (data_pmml4s.csv):

age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
34,?,,HS-grad,9,Never-married,?,Not-in-family,Black,Female,,,,,<=50K
27,Private,,HS-grad,9,Never-married,Adm-clerical,Own-child,Black,Male,,,,,<=50K
44,Private,,HS-grad,9,Never-married,Sales,Other-relative,White,Female,,,,,<=50K
26,Private,,HS-grad,9,Never-married,Craft-repair,Own-child,White,Male,,,,,<=50K
25,Private,,Assoc-voc,11,Never-married,Other-service,Own-child,White,Female,,,,,<=50K

I am using the following code to run:

import pandas as pd
import pypmml
from pypmml import Model

print(pypmml.__version__)
df = pd.read_csv('data_pmml4s.csv')
m = Model.load('model.pmml')
res = m.predict(df)
print(res)

It runs with 0.9.9:

$ python3 run.py
0.9.9
   probability(<=50K)  probability(>50K)
0                 NaN                NaN
1                 NaN                NaN
2                 NaN                NaN
3                 NaN                NaN
4                 NaN                NaN

But fails with 0.9.10:

$ python3 run.py
0.9.10
Traceback (most recent call last):
  File "/Users/soumyava.das/Desktop/run.py", line 8, in <module>
    res = m.predict(df)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pypmml/model.py", line 177, in predict
    result = [self.call('predict', record) for record in records]
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pypmml/model.py", line 177, in <listcomp>
    result = [self.call('predict', record) for record in records]
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pypmml/base.py", line 134, in call
    return call_java_func(getattr(self._java_model, name), *a)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pypmml/base.py", line 41, in call_java_func
    return _java2py(func(*args))
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/py4j/java_gateway.py", line 1309, in __call__
    return_value = get_return_value(
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o0.predict.
: java.lang.ArrayIndexOutOfBoundsException: 20
    at org.pmml4s.data.GenericMutableSeries.update(MutableSeries.scala:70)
    at org.pmml4s.transformations.DerivedField.write(DerivedField.scala:66)
    at org.pmml4s.transformations.DerivedField.get(DerivedField.scala:55)
    at org.pmml4s.transformations.FieldExpression$class.eval(Expression.scala:58)
    at org.pmml4s.transformations.FieldRef.eval(FieldRef.scala:35)
    at org.pmml4s.transformations.DerivedField.eval(DerivedField.scala:74)
    at org.pmml4s.transformations.DerivedField.write(DerivedField.scala:65)
    at org.pmml4s.transformations.TransformationDictionary.transform(TransformationDictionary.scala:70)
    at org.pmml4s.model.Model$$anonfun$prepare$1.apply(Model.scala:395)
    at org.pmml4s.model.Model$$anonfun$prepare$1.apply(Model.scala:395)
    at scala.Option.map(Option.scala:146)
    at org.pmml4s.model.Model.prepare(Model.scala:395)
    at org.pmml4s.model.RegressionModel.predict(RegressionModel.scala:54)
    at org.pmml4s.model.Model.predict(Model.scala:193)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Me and @wjtdc will appreciate if this can be fixed without disturbing the thread safety fix that you have already helped us with.

scorebot commented 3 years ago

@soumyava I can reproduce it using the attached model and data, it is definitely a defect, I have fixed it in the next version 0.9.11, please try!

soumyava commented 3 years ago

@scorebot I verified with java the fix takes care of this. You can close this issue. Please update maven central and the pypmml packages.

scorebot commented 3 years ago

The version 0.9.11 has been pushed to the maven central and pypi.