autodeployai / pmml4s

PMML scoring library for Scala
https://www.pmml4s.org/
Apache License 2.0
58 stars 9 forks source link

Ensemble Neural Networks #5

Closed jperuggia closed 4 years ago

jperuggia commented 4 years ago

Attempted to create a model from the example provided by the DMG : here

I am using Java to instantiate the model with a snippet such as this :

Model m = Model.fromFile(tempFile);

in which tempFile is file of the above pmml downloaded.

Doing so fails to create the model which generates an exception which I can reproduce.

Was wondering if this was due to ensembles of Neural nets not being supported or some other underlying issue?

scorebot commented 4 years ago

@jperuggia Thanks for your finding. It's a defect of the PMML4S library, actually, this issue has already been fixed in the latest code, but it's not included in the official latest 0.9.3, it will be in the next version 0.9.4. I will let you know once it's released, it will be soon.

scorebot commented 4 years ago

@jperuggia The latest version 0.9.4 is released, pls try the model again, it should be OK.

jperuggia commented 4 years ago

Awesome, thanks for the update. I will verify that it is working early next week and revisit. Will provide new stack trace if it errors still

vruusmann commented 4 years ago

This is an invalid model file by KNIME - the target field is a categorical integer, but the ensemble is predicting a continuous double.

It's an error to silently map a double to an integer. It must be rounded explicitly (via ceil, floot, round or something else). There is no such rounding directive in KNIME files - hence they are invalid.

DMG.org PMML examples section is full of outdated and invalid files - stay away from them for your own sanity!

scorebot commented 4 years ago

@vruusmann Thanks for your comments. I agree with you that most examples of DMG are outdated and invalid. About the model ensemble_audit_mlp.xml from KNIME above, the target field TARGET_Adjusted is continuous, for example:

<DataField dataType="integer" name="TARGET_Adjusted" optype="continuous">
  <Interval closure="closedClosed" leftMargin="0.0" rightMargin="1.0"/>
</DataField>

I know the data type is integer, so in the training process, the value of TARGET_Adjusted is either 0 or 1, but for the prediction phase, the predicted value could be a float number, if there is no round method specified in PMML, a raw float number returned is reasonable, the client is free to round the predicted value.

vruusmann commented 4 years ago

The JPMML family of libraries does not make any assumptions or "helper guesses" - life has shown that it simply perpetuates the problem. In the current case, this issue should be reported to KNIME so that they would fix their broken PMML producer (this example PMML file is 5+ years old, but I'm sure that the latest KNIME version is still producing such invalid markup), and DMG.org should very critically revise what is listed in their examples section.

What I mean by "perpetuating the problem" - if the model indicates that the type of a target field is integer (two possible values - 0 and 1), but you are returning a float or double value sometimes (eg. 0.576), then you are simply shifting this decision - "how should a floating-point value be rounded to an integer" downstream, to somebody else. This somebody is even less informed to make a correct decision. In my opinion, the only correct solution is to throw an exception "expected integer, but got a non-integer" in the first location (ie. inside the PMML engine) where the problem is detected.

KNIME must:

scorebot commented 4 years ago

Yes, the KNIME needs to correct the abnormal PMML. As a loose standard, a PMML model could follow the schema of PMML completely while it could have sematic problems, specially PMML models produced by those commercial products, for the problem, it's a common case, we have seen several examples that have the same problem from different products. We always expect all existing PMML models are correct, but not all are true in the real world.

There is no problem that a PMML scoring engine throws an exception against such a model. For our scoring library, a warning message will be used instead of an error, let the client know the risk to score such PMML. Anyway, we appreciate your comments.

jperuggia commented 4 years ago

@scorebot I have verified that the fix allows the model to load and execute. I agree with @vruusmann that there shouldn't be assumptions on how to score against a model which is exported with an invalid exported model by KNIME as it shouldn't be up to the execution engine to determine how to handle data inconsistencies.

I am going to close out this task as it addresses the problem at hand with this library. The invalid PMML produced by KNIME is being tracked on their end in a different issue.

I would suggest adding information to the libraries page that explains that there is some assumptions made here in PMML4S for typecasting as it might cause issues for some consumers if they aren't expecting it to run and to to be more in line with the JPMML Evaluators.