autodeployai / pypmml-spark

Python PMML scoring library for PySpark as SparkML Transformer
Apache License 2.0
21 stars 2 forks source link

Output unmatch between jpmml and pypmml #4

Closed DiegoSong closed 4 years ago

DiegoSong commented 4 years ago

I made a pmml which using PMMLPipeline = DataFrameMapper + LogisticRegression DataFrameMapper like:

[ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=np.nan), 
                                           CutTransformer(bins=c_score_bins, right=False, labels=c_score_values), 
                                           SimpleImputer(strategy='constant', fill_value=0.0)]

The pypmml output seems uncorrect compare with jpmml and pipeline.predict_proba.

scorebot commented 4 years ago

@DiegoSong Thanks for your info. Could you attach your model here? then we can reproduce the issue easily.

DiegoSong commented 4 years ago

@scorebot Thanks for your replay. I send pmml to this address autodeploy.ai@gmail.com.

scorebot commented 4 years ago

@DiegoSong Thanks, I have got the pmml model. I tested it in my environment installed the latest pypmml 0.9.6:

>>> from pypmml import Model
>>> model = Model.load('model.pmml')
>>> model.predict({'cProb': 0.06, 'pProb': 0.3, 'oProb': 0.5, 'T3Prob': 0.1, 'cEmrgProb': 0.06, 'pEmrgProb': 0.3, 'oEmrgProb': 0.5})
{'probability(1)': 0.18478507571419323, 'probability(0)': 0.8152149242858068}

The input record contains all valid values for those input fields. I did not find anything wrong. What about the results from jpmml and pipeline.predict_proba?

DiegoSong commented 4 years ago

https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv The data shows diff between jpmml and pypmml. And its equal when feature not nan.

scorebot commented 4 years ago

I debug both pmml4s (used by pypmml) and jpmml, the differences are caused by the different missing value handling policies. Takes the first line of jpmml_pypmml.csv as an example: Values of both fields oProb and oEmrgProb are missing, the missingValueReplacement="NaN" is used based on the PMML, so both take the value NaN, then we need to compute values of both derived fields cut(oProb) and cut(oEmrgProb):

  1. jpmml always returns the last interval for the NaN, cut(oProb)=0.444133, cut(oEmrgProb)=0.480491
  2. pypmml always returns a missing value for the NaN because there is no defaultValue defined, cut(oProb)=missing, cut(oEmrgProb)=missing

When evaluate imputer(cut(oProb)) and imputer(cut(oEmrgProb)), jpmml and pypmml return different values:

  1. jpmml: imputer(cut(oProb))=0.444133, cut(oEmrgProb)=0.480491
  2. pypmml: imputer(cut(oProb))=0.06101, cut(oEmrgProb)=-8.09E-4, the imputers are defined in the PMML.

That's the reason why the final results are different. I think pypmml is more reasonable, the double value NaN (not a number) should be treated as a missing value because there is no interval contains such value.

BTW, what are the results of the native python model? if the results are the same as jpmml, the generated pmml model should have a problem, you need to file a bug for JPMML-SkLearn.

vruusmann commented 4 years ago

@DiegoSong Don't ever specify Domain.missing_value_replacement = NaN. Use some meaningful constant.

I think pypmml is more reasonable, the double value NaN (not a number) should be treated as a missing value because there is no interval contains such value.

NaN is an invalid value, not a missing value.

See http://dmg.org/pmml/v4-4/Transformations.html

notANumber. This is the value returned from such meaningless expressions as the logarithm of a negative number. It should not be used as a generic missing value, as missing values are properly those which are unknown, indeterminate, or non-applicable. All arithmetic expressions involving NaN will evaluate to NaN.

We may agree that a Discretize element should return NaN when inputted with NaN, but returning a Discretize@defaultValue is definitely not correct.

scorebot commented 4 years ago

@vruusmann Thanks for your comments. Yes, NaN (not a number) is an invalid value, not a missing value. Suppose Discretize returns NaN when the input is NaN, what's the final result? NaN or a normal result based on the imputers?

vruusmann commented 4 years ago

Suppose Discretize returns NaN when the input is NaN, what's the final result?

It seems to me that the NaN should be propagated between expression and model elements till the end - resulting in a NaN prediction. For example, when one numeric predictor is NaN (whether supplied directly by the end user, or computed during data pre-processing), then RegressionModel should also predict NaN.

This issue manifests itself with NaN right now. But if should apply to all missing values. For example, consider the following field definition:

<DataField name="x" dataType="double">
  <Value value="0" property="invalid"/>
</DataField>

When the end user supplies x = 0, then it is classified as a missing value, and should trigger the same chain of handlers as x = NaN would.

The handling of invalid values is almost completely unspecified by DMG.org. They should clarify.

DiegoSong commented 4 years ago

Thank you for your reply! Missing value also have information. I meant to be able to replace missing value with a meaningful value which should comes from target 'y'. So if fillna with some constant like -9999. It may merge into other bin. In my case it causes 0.5% AUC reduce.

pypmml always returns a missing value for the NaN

This is more reasonable for me. Here may be a safe way when using sklearn2pmml to handle missing value:

ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=None)

instead of

ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=np.nan)

Parameter missingValueReplacement will not generate in pmml file. After that jpmml, pypmml,sklearn2pmml gives the same output. see: https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv
sorry for the 000000|999999

vruusmann commented 4 years ago

Parameter missingValueReplacement not generates in pmml file.

@DiegoSong The expression ContinuousDomain(missing_value_replacement=None) is a default no-op instruction, hence there's no PMML markup generated. However, you can keep it in your Python code for extra clarity.

DiegoSong commented 4 years ago

The expression ContinuousDomain(missing_value_replacement=None) is a default no-op instruction, hence there's no PMML markup generated.

It may helpful for predict on csv file. To avoid this error:

Exception in thread "main" org.jpmml.evaluator.InvalidResultException (at or around line 26 of the PMML document): Field "oProb" cannot accept user input value ""
at org.jpmml.evaluator.InputFieldUtil.performInvalidValueTreatment(InputFieldUtil.java:235)
at org.jpmml.evaluator.InputFieldUtil.prepareScalarInputValue(InputFieldUtil.java:151)
at org.jpmml.evaluator.InputFieldUtil.prepareInputValue(InputFieldUtil.java:111)
at org.jpmml.evaluator.InputField.prepare(InputField.java:70)
at org.jpmml.evaluator.example.EvaluationExample.execute(EvaluationExample.java:413)
at org.jpmml.evaluator.example.Example.execute(Example.java:92)
at org.jpmml.evaluator.example.EvaluationExample.main(EvaluationExample.java:262)