autodeployai / pypmml

Python PMML scoring library
Apache License 2.0
76 stars 22 forks source link

MissingValueReplacement not working as expected #49

Closed wf-r closed 1 year ago

wf-r commented 2 years ago

Hi,

sorry to bother you (again :)):

I think missingValueReplacement of the mining schema is not working as expected (see https://dmg.org/pmml/v4-3/MiningSchema.html):

If this attribute is specified then a missing input value is automatically replaced by the given value. That is, the model itself works as if the given value was found in the original input.

Example (on Python 3.8.6, PyPMML 0.9.17:

import pypmml
import pandas as pd
from pandas._testing import assert_frame_equal
import numpy as np

df = pd.DataFrame([["test"], ["MISSING"], [np.NaN], ["NA"], ["X"]], columns=["TEST"])
model_1 = pypmml.Model.fromString("""<PMML xmlns="https://www.dmg.org/PMML-4_3" version="4.3">
    <Header copyright="dmg.org"/>
    <DataDictionary>
        <DataField name="TEST" optype="categorical" dataType="string">
            <Value property="valid" value="MISSING"/>
            <Value property="valid" value="test"/>
        </DataField>
        <DataField name="SCORE" optype="continuous" dataType="double"/>
    </DataDictionary>
    <RegressionModel functionName="regression" modelName="" normalizationMethod="softmax">
        <MiningSchema>
            <MiningField invalidValueTreatment="returnInvalid" missingValueReplacement="MISSING" name="TEST" usageType="active"/>
            <MiningField name="SCORE" usageType="target"/>
        </MiningSchema>
        <RegressionTable targetCategory="1" intercept="0.5">
            <CategoricalPredictor name="TEST" value="MISSING" coefficient="0.3"/>
            <CategoricalPredictor name="TEST" value="test" coefficient="-0.2"/>
        </RegressionTable>
        <RegressionTable targetCategory="0" intercept="0"/>
    </RegressionModel>
</PMML>""")
model_2 = pypmml.Model.fromString("""<PMML xmlns="https://www.dmg.org/PMML-4_3" version="4.3">
    <Header copyright="dmg.org"/>
    <DataDictionary>
        <DataField name="TEST" optype="categorical" dataType="string">
            <Value property="valid" value="MISSING"/>
            <Value property="valid" value="test"/>
        </DataField>
        <DataField name="SCORE" optype="continuous" dataType="double"/>
    </DataDictionary>
    <RegressionModel functionName="regression" modelName="" normalizationMethod="softmax">
        <MiningSchema>
            <MiningField invalidValueTreatment="asMissing" missingValueReplacement="MISSING" name="TEST" usageType="active"/>
            <MiningField name="SCORE" usageType="target"/>
        </MiningSchema>
        <RegressionTable targetCategory="1" intercept="0.5">
            <CategoricalPredictor name="TEST" value="MISSING" coefficient="0.3"/>
            <CategoricalPredictor name="TEST" value="test" coefficient="-0.2"/>
        </RegressionTable>
        <RegressionTable targetCategory="0" intercept="0"/>
    </RegressionModel>
</PMML>""")
print(model_1.predict(df))  # Missing values are not replaced, invalid is returned instead.
print(model_2.predict(df))  # Invalid values should be replaced by "MISSING", however results indicate that no variable is used in regression

Best Wolfgang

scorebot commented 2 years ago

@wf-r Those three values [np.NaN], ["NA"], ["X"] should be treated as invalid values, so for the model_1, the NaN result returned as expected because invalidValueTreatment="returnInvalid", you can input None to test the missing value.

For the model_2, there is a bug in the internal pmml4s library, we have fixed it now, please reinstall the latest pypmml from git by pip install --upgrade git+https://github.com/autodeployai/pypmml.git. Since the invalidValueTreatment="asMissing", so the result of those invalid values should be the same as the input ["MISSING"], for example:

>>> model_1.predict(df)
   predicted_SCORE
0         0.574443
1         0.689974
2              NaN
3              NaN
4              NaN
>>> model_2.predict(df)
   predicted_SCORE
0         0.574443
1         0.689974
2         0.689974
3         0.689974
4         0.689974
scorebot commented 2 years ago

@wf-r Please, let me know if still have a problem

wf-r commented 1 year ago

Thanks, issue is fixed (and thanks for the explanation concerning None to be the missing value).