Closed DiegoSong closed 4 years ago
@DiegoSong Thanks for your info. Could you attach your model here? then we can reproduce the issue easily.
@scorebot Thanks for your replay. I send pmml to this address autodeploy.ai@gmail.com.
@DiegoSong Thanks, I have got the pmml model. I tested it in my environment installed the latest pypmml 0.9.6
:
>>> from pypmml import Model
>>> model = Model.load('model.pmml')
>>> model.predict({'cProb': 0.06, 'pProb': 0.3, 'oProb': 0.5, 'T3Prob': 0.1, 'cEmrgProb': 0.06, 'pEmrgProb': 0.3, 'oEmrgProb': 0.5})
{'probability(1)': 0.18478507571419323, 'probability(0)': 0.8152149242858068}
The input record contains all valid values for those input fields. I did not find anything wrong. What about the results from jpmml and pipeline.predict_proba?
https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv The data shows diff between jpmml and pypmml. And its equal when feature not nan.
I debug both pmml4s (used by pypmml) and jpmml, the differences are caused by the different missing value handling policies. Takes the first line of jpmml_pypmml.csv
as an example:
Values of both fields oProb
and oEmrgProb
are missing, the missingValueReplacement="NaN"
is used based on the PMML, so both take the value NaN
, then we need to compute values of both derived fields cut(oProb)
and cut(oEmrgProb)
:
defaultValue
defined, cut(oProb)=missing, cut(oEmrgProb)=missingWhen evaluate imputer(cut(oProb))
and imputer(cut(oEmrgProb))
, jpmml and pypmml return different values:
That's the reason why the final results are different. I think pypmml is more reasonable, the double value NaN
(not a number) should be treated as a missing value because there is no interval contains such value.
BTW, what are the results of the native python model? if the results are the same as jpmml, the generated pmml model should have a problem, you need to file a bug for JPMML-SkLearn
.
@DiegoSong Don't ever specify Domain.missing_value_replacement = NaN
. Use some meaningful constant.
I think pypmml is more reasonable, the double value NaN (not a number) should be treated as a missing value because there is no interval contains such value.
NaN is an invalid value, not a missing value.
See http://dmg.org/pmml/v4-4/Transformations.html
notANumber. This is the value returned from such meaningless expressions as the logarithm of a negative number. It should not be used as a generic missing value, as missing values are properly those which are unknown, indeterminate, or non-applicable. All arithmetic expressions involving NaN will evaluate to NaN.
We may agree that a Discretize
element should return NaN
when inputted with NaN
, but returning a Discretize@defaultValue
is definitely not correct.
@vruusmann Thanks for your comments. Yes, NaN (not a number) is an invalid value, not a missing value. Suppose Discretize
returns NaN when the input is NaN, what's the final result? NaN or a normal result based on the imputers?
Suppose Discretize returns NaN when the input is NaN, what's the final result?
It seems to me that the NaN
should be propagated between expression and model elements till the end - resulting in a NaN
prediction. For example, when one numeric predictor is NaN
(whether supplied directly by the end user, or computed during data pre-processing), then RegressionModel
should also predict NaN
.
This issue manifests itself with NaN
right now. But if should apply to all missing values. For example, consider the following field definition:
<DataField name="x" dataType="double">
<Value value="0" property="invalid"/>
</DataField>
When the end user supplies x = 0
, then it is classified as a missing value, and should trigger the same chain of handlers as x = NaN
would.
The handling of invalid values is almost completely unspecified by DMG.org. They should clarify.
Thank you for your reply! Missing value also have information. I meant to be able to replace missing value with a meaningful value which should comes from target 'y'. So if fillna with some constant like -9999. It may merge into other bin. In my case it causes 0.5% AUC reduce.
pypmml always returns a missing value for the NaN
This is more reasonable for me.
Here may be a safe way when using sklearn2pmml
to handle missing value:
ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=None)
instead of
ContinuousDomain(invalid_value_treatment='as_missing', missing_value_replacement=np.nan)
Parameter missingValueReplacement
will not generate in pmml file.
After that jpmml, pypmml,sklearn2pmml gives the same output.
see: https://github.com/DiegoSong/CodeBox/blob/master/jpmml_pypmml.csv
sorry for the 000000|999999
Parameter missingValueReplacement not generates in pmml file.
@DiegoSong The expression ContinuousDomain(missing_value_replacement=None)
is a default no-op instruction, hence there's no PMML markup generated. However, you can keep it in your Python code for extra clarity.
The expression ContinuousDomain(missing_value_replacement=None) is a default no-op instruction, hence there's no PMML markup generated.
It may helpful for predict on csv file. To avoid this error:
Exception in thread "main" org.jpmml.evaluator.InvalidResultException (at or around line 26 of the PMML document): Field "oProb" cannot accept user input value ""
at org.jpmml.evaluator.InputFieldUtil.performInvalidValueTreatment(InputFieldUtil.java:235)
at org.jpmml.evaluator.InputFieldUtil.prepareScalarInputValue(InputFieldUtil.java:151)
at org.jpmml.evaluator.InputFieldUtil.prepareInputValue(InputFieldUtil.java:111)
at org.jpmml.evaluator.InputField.prepare(InputField.java:70)
at org.jpmml.evaluator.example.EvaluationExample.execute(EvaluationExample.java:413)
at org.jpmml.evaluator.example.Example.execute(Example.java:92)
at org.jpmml.evaluator.example.EvaluationExample.main(EvaluationExample.java:262)
I made a pmml which using PMMLPipeline = DataFrameMapper + LogisticRegression DataFrameMapper like:
The pypmml output seems uncorrect compare with jpmml and pipeline.predict_proba.