Closed jtzhang17 closed 2 years ago
@jtzhang17 The method setPredictionCol
is used to set the column name for the output of the prediciton, you can not use it to change what results returned. About the results of PMML models, you can refer to the section Understand the result values
in the https://github.com/autodeployai/pmml4s. Could you mind sending your model to me for further investigation?
@jtzhang17 Thanks for your model. Open the model by an editor, we can see the model defines the output:
<OutputField name="pmml(prediction)" optype="categorical" dataType="integer" isFinalResult="false"/>
<OutputField name="prediction" optype="continuous" dataType="double" feature="transformedValue">
<MapValues outputColumn="data:output" dataType="double">
<FieldColumnPair field="pmml(prediction)" column="data:input"/>
<InlineTable>
<row>
<data:input>0</data:input>
<data:output>0</data:output>
</row>
<row>
<data:input>1</data:input>
<data:output>1</data:output>
</row>
</InlineTable>
</MapValues>
</OutputField>
The first output field pmml(prediction)
is an intermediate result, so only the second is available, that's the reason there is only one prediction column returned by default.
In order to get other results, like probability, the internal PMML4S library introduced the param supplementOutput
to output all possible results, but it was not picked up by pypmml-spark
. We will publish a new version to support it, and let you know once it's done.
@scorebot Thanks for the quick update. I also tested the commonly used testing data set Iris.csv
and model single_iris_dectree.xml
against the pypmml-spark
package, but the results contains 6 extra columns (predicted_class
, probability
, and more). This seems weird to me too.
It's expected because the model single_iris_dectree.xml
does not contain the Output
element, which is optional. If there are no output fields defined, we will try to generate all possible results. For more details, see the section Understand the result values
in the https://github.com/autodeployai/pmml4s.
@scorebot Thanks. So I think a hotfix might be to just discard the output
definition from my PMML model file, while you are working the long-term solution?
@jtzhang17 Please, try to install the latest 0.9.14, then call model.setSupplementOutput(True)
to output all possible results.
@jtzhang17 Please, try to install the latest 0.9.14, then call
model.setSupplementOutput(True)
to output all possible results.
@scorebot I just tried the new version, and it seems working as expected to output all probability columns. I might have one minor request to make it perfect: currently for binary classification (label 0, 1
) it outputs probability(0), probability(1)
as the probability column names, but they are not legal column names due to the postfix (0) or (1)'. Could you please modify the code a little bit to make the columns to be
probability_0, probability_1(and
probability_2, ...` for multi-class problems)?
In addition, the predicted_{target}
column is defined as LongType
, but I experienced an error like
java.lang.RuntimeException: java.lang.Long is not a valid external type for schema of double
So it is better to still make it DoubleType
.
I think these two minor flaws might be worth a new update.
About the error of data type, it's a bug in the PMML4S library, I'm fixing it. Regarding the issue of names probability(0)
and probability(1)
, which come from the PMML model, for example:
<Output>
<OutputField name="probability(0)" optype="continuous" dataType="double" feature="probability" value="0"/>
<OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
</Output>
And, I don't think they are illegal column, the following code can show them correctly:
>>> dfout = model.setSupplementOutput(True).setPrependInputs(False).transform(df)
>>> dfout.show()
22/03/12 13:56:32 WARN DAGScheduler: Broadcasting large task binary with size 9.3 MiB
+----------+--------------------+------------------+---------------+------------------+
|prediction| probability(0)| probability(1)|predicted_label| probability|
+----------+--------------------+------------------+---------------+------------------+
| 1.0|7.489532779244579E-4|0.9992510467220755| 1|0.9992510467220755|
+----------+--------------------+------------------+---------------+------------------+
>>> one = dfout.select("probability(0)")
>>> one.show()
22/03/12 13:57:19 WARN DAGScheduler: Broadcasting large task binary with size 9.3 MiB
+--------------------+
| probability(0)|
+--------------------+
|7.489532779244579E-4|
+--------------------+
If you want the default names like probability_1
, you can remove the Output of probability from PMML.
@jtzhang17 The first issue has been fixed in the latest version, please try to install it from Github, for example:
pip install --upgrade git+https://github.com/autodeployai/pypmml-spark.git
@jtzhang17 The first issue has been fixed in the latest version, please try to install it from Github, for example:
pip install --upgrade git+https://github.com/autodeployai/pypmml-spark.git
Yeah I think the column name issue should be trivial and can be solved with some easy renaming. If the LongType
issue is fixed, could you please bump the version of this package? I will test it soon. Thanks!
The latest version 0.9.15
has been pushed to PyPI, please try to install
The latest version
0.9.15
has been pushed to PyPI, please try to install
The latest version works fine and as expected. I think this issue can be closed.
For binary classification, this tool only produces one addition column named "prediction" to the output dataframe. In case that we need the raw prediction probability, call
model = model.setPredictionCol("rawPrediction")
doesn't solve the problem. That means thesetPredictionCol()
function doesn't work as expected. In many cases, we really want to output all output fields provided by the model, just like in the example in this page (https://github.com/autodeployai/pmml4s-spark). Please help. Thanks!