autodeployai / pypmml

Python PMML scoring library
Apache License 2.0
75 stars 22 forks source link

Categorical data field with all values valid #48

Closed wf-r closed 1 year ago

wf-r commented 2 years ago

Hi,

according to Link, if a categorical field contains at least one value with a valid property, these values completely define the set of valid values. Otherwise any value should be valid by default.

Is the second part respected in PyPmml?

If I test the following model in PyPmml 0.9.16 on Python 3.8.6, it seems that "val1" and "val2" are considered valid, however "val3" is not.

import pypmml
import pandas as pd
from pandas._testing import assert_frame_equal
import numpy as np

model = pypmml.Model.fromString("""<PMML xmlns="https://www.dmg.org/PMML-4_3" version="4.3">
    <Header copyright="dmg.org"/>
    <DataDictionary>
        <DataField name="TEST" optype="categorical" dataType="string">
        </DataField>
        <DataField name="SCORE" optype="continuous" dataType="double"/>
    </DataDictionary>
    <TreeModel modelName="test" functionName="classification" missingValueStrategy="none">
        <MiningSchema>
            <MiningField name="TEST" usageType="active"/>
            <MiningField name="SCORE" usageType="target"/>
        </MiningSchema>
        <Node>
            <True/>
            <Node score="1.0">
                <SimplePredicate field="TEST" operator="equal" value="val1"/>
            </Node>
            <Node score="2.0">
                <SimplePredicate field="TEST" operator="equal" value="val2"/>
            </Node>
            <Node score="3.0">
                <True/>
            </Node>
        </Node>
    </TreeModel>
</PMML>""")
df = pd.DataFrame([["val1"], ["val2"], ["val3"]], columns=["TEST"])
assert_frame_equal(model.predict(df), pd.DataFrame([[1.0], [2.0], [3.0]], columns=["predicted_SCORE"]))  # does indeed return 1.0, 2.0, np.NaN

Note that adding invalidValueTreatment="asIs" to the MiningField fixes this.

Best Wolfgang

scorebot commented 2 years ago

@wf-r Thanks for your findings. You are right, val3 should be treated as a valid value, it's a defect, and we will fix it as soon as possible.

scorebot commented 2 years ago

The issue has been fixed. Please, reinstall the latest version from GitHub by the following command:

pip install --upgrade git+https://github.com/autodeployai/pypmml.git

Please, let me know if you still have a problem.

scorebot commented 2 years ago

@wf-r Can we close the issue? Please let me know if you have any issues.

wf-r commented 1 year ago

Thanks for the quick fix, issue is fixed!