autodeployai / pypmml-spark

Python PMML scoring library for PySpark as SparkML Transformer
Apache License 2.0
21 stars 2 forks source link

It couldn't output raw prediction probability. #7

Closed jtzhang17 closed 2 years ago

jtzhang17 commented 2 years ago

For binary classification, this tool only produces one addition column named "prediction" to the output dataframe. In case that we need the raw prediction probability, call model = model.setPredictionCol("rawPrediction") doesn't solve the problem. That means the setPredictionCol() function doesn't work as expected. In many cases, we really want to output all output fields provided by the model, just like in the example in this page (https://github.com/autodeployai/pmml4s-spark). Please help. Thanks!

scorebot commented 2 years ago

@jtzhang17 The method setPredictionCol is used to set the column name for the output of the prediciton, you can not use it to change what results returned. About the results of PMML models, you can refer to the section Understand the result values in the https://github.com/autodeployai/pmml4s. Could you mind sending your model to me for further investigation?

scorebot commented 2 years ago

@jtzhang17 Thanks for your model. Open the model by an editor, we can see the model defines the output:

            <OutputField name="pmml(prediction)" optype="categorical" dataType="integer" isFinalResult="false"/>
            <OutputField name="prediction" optype="continuous" dataType="double" feature="transformedValue">
                <MapValues outputColumn="data:output" dataType="double">
                    <FieldColumnPair field="pmml(prediction)" column="data:input"/>
                    <InlineTable>
                        <row>
                            <data:input>0</data:input>
                            <data:output>0</data:output>
                        </row>
                        <row>
                            <data:input>1</data:input>
                            <data:output>1</data:output>
                        </row>
                    </InlineTable>
                </MapValues>
            </OutputField>

The first output field pmml(prediction) is an intermediate result, so only the second is available, that's the reason there is only one prediction column returned by default.

In order to get other results, like probability, the internal PMML4S library introduced the param supplementOutput to output all possible results, but it was not picked up by pypmml-spark. We will publish a new version to support it, and let you know once it's done.

jtzhang17 commented 2 years ago

@scorebot Thanks for the quick update. I also tested the commonly used testing data set Iris.csv and model single_iris_dectree.xml against the pypmml-spark package, but the results contains 6 extra columns (predicted_class, probability, and more). This seems weird to me too.

scorebot commented 2 years ago

It's expected because the model single_iris_dectree.xml does not contain the Output element, which is optional. If there are no output fields defined, we will try to generate all possible results. For more details, see the section Understand the result values in the https://github.com/autodeployai/pmml4s.

jtzhang17 commented 2 years ago

@scorebot Thanks. So I think a hotfix might be to just discard the output definition from my PMML model file, while you are working the long-term solution?

scorebot commented 2 years ago

@jtzhang17 Please, try to install the latest 0.9.14, then call model.setSupplementOutput(True) to output all possible results.

jtzhang17 commented 2 years ago

@jtzhang17 Please, try to install the latest 0.9.14, then call model.setSupplementOutput(True) to output all possible results.

@scorebot I just tried the new version, and it seems working as expected to output all probability columns. I might have one minor request to make it perfect: currently for binary classification (label 0, 1) it outputs probability(0), probability(1) as the probability column names, but they are not legal column names due to the postfix (0) or (1)'. Could you please modify the code a little bit to make the columns to beprobability_0, probability_1(andprobability_2, ...` for multi-class problems)?

In addition, the predicted_{target} column is defined as LongType, but I experienced an error like

java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of double

So it is better to still make it DoubleType.

I think these two minor flaws might be worth a new update.

scorebot commented 2 years ago

About the error of data type, it's a bug in the PMML4S library, I'm fixing it. Regarding the issue of names probability(0) and probability(1), which come from the PMML model, for example:

    <Output>
        <OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
        <OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
    </Output>

And, I don't think they are illegal column, the following code can show them correctly:

>>> dfout = model.setSupplementOutput(True).setPrependInputs(False).transform(df)
>>> dfout.show()
22/03/12 13:56:32 WARN DAGScheduler: Broadcasting large task binary with size 9.3 MiB
+----------+--------------------+------------------+---------------+------------------+
|prediction|      probability(0)|    probability(1)|predicted_label|       probability|
+----------+--------------------+------------------+---------------+------------------+
|       1.0|7.489532779244579E-4|0.9992510467220755|              1|0.9992510467220755|
+----------+--------------------+------------------+---------------+------------------+

>>> one = dfout.select("probability(0)")
>>> one.show()
22/03/12 13:57:19 WARN DAGScheduler: Broadcasting large task binary with size 9.3 MiB
+--------------------+
|      probability(0)|
+--------------------+
|7.489532779244579E-4|
+--------------------+

If you want the default names like probability_1, you can remove the Output of probability from PMML.

scorebot commented 2 years ago

@jtzhang17 The first issue has been fixed in the latest version, please try to install it from Github, for example:

pip install --upgrade git+https://github.com/autodeployai/pypmml-spark.git
jtzhang17 commented 2 years ago

@jtzhang17 The first issue has been fixed in the latest version, please try to install it from Github, for example:

pip install --upgrade git+https://github.com/autodeployai/pypmml-spark.git

Yeah I think the column name issue should be trivial and can be solved with some easy renaming. If the LongType issue is fixed, could you please bump the version of this package? I will test it soon. Thanks!

scorebot commented 2 years ago

The latest version 0.9.15 has been pushed to PyPI, please try to install

jtzhang17 commented 2 years ago

The latest version 0.9.15 has been pushed to PyPI, please try to install

The latest version works fine and as expected. I think this issue can be closed.