Some issues of models generated by the example notebooks

scorebot commented 5 years ago

Hi guys, Thanks for the awesome project. I have some questions about the models produced by the example notebooks, I use the PyPMML to test those models.

lgbmr_pmml_preprocess.pmml exported by nyoka/examples/lgbm/3_lgbm_With_PreProcess .ipynb. Open the notebook, add the following two cells at the end:
```
pipeline_obj.predict(x_test)
```
Make prediction against x_test using the built pipeline model, the predicted value is 23.51497829 for the frist record

from pypmml import Model
model = Model.fromFile("lgbmr_pmml_preprocess.pmml")
model.predict(x_test)

Load model by PyPMML, then make prediction again, the first predicted value is 22.055896. Both values are different, while they are expected identical to each other.

Then, I tried to debug the case, I found there were two potential problems about the exported PMML, I need to confirm with you if they are real problems. Please, correct me if there are something wrong.

The attribute wordSeparatorCharacterRE of TextIndex, and (?u)\b\w\w+\b is used for all. As describe in DMG the wordSeparatorCharacterRE attribute can be used to pass a regular expression containing possible word separator characters, when the (?u)\b\w\w+\b is applied to all related derived fields, all values are evaluated as 0.0, e.g. the first test record is {"car name": "ford pinto", "displacement": 122.0}, to evaluate the derived field count_vec@[car name](ford), the const term "ford" is split into "", the input value "ford pinto" split to two empty string. Please, check if the value (?u)\b\w\w+\b is suitable here.
Then I modified the PMML, change all (?u)\b\w\w+\b to the default value \s. Now, I think the values of derived fields are fine, but the final result is still the original value 22.055896, I checked those ensembly trees, take the field count_vec@[car name](ford) as an example again, all tree nodes use it in such case:
```
<Node>
<SimplePredicate field="count_vec@[car name](ford)" operator="lessThan" value="0.0000000000000000"/>
<Node>
...
</Node>
</Node>
<Node score="-0.13932418765500187">
<SimplePredicate field="count_vec@[car name](ford)" operator="greaterOrEqual" value="0.0000000000000000"/>
</Node>
```
The first node will be never used, and the second node is always hit, so the field car name should be useless, I change the car name to any string, the evaluated value is still 22.055896. Could you check if it's desired?

dtr_pmml.pmml exported by nyoka/examples/skl/5_Decision_Tree_With_Tf-Idf.ipynb, it still has the same issue that the attribute wordSeparatorCharacterRE takes (?u)\b\w\w+\b.
rf_pmml.pmml exported by nyoka/examples/skl/3_RF_With_pre-processing.ipynb, there is an output field predicted_Species:
```
<OutputField name="predicted_Species" optype="categorical" dataType="string" feature="predictedValue"/>
```
Its data type is string, but I think it should be integer that matches its integer target Species.
OneClassSVM_model.pmml exported by nyoka/examples/skl/OneClassSVM_model.pmml, it's a AnomalyDetectionModel with version 4.4, will it be a standard model of 4.4?
Both 2classMBNet.pmml and sequentialModel.pmml of Keras models, they use the new model type DeepNetwork with 4.4, will it be a part of PMML 4.4?

nyoka-pmml commented 5 years ago

Hi @scorebot, thanks for your feedback. Answers to your questions are -

This will be taken care in the next releases. Thanks for pointing this out.
Yes, AnomalyDetectionModel will be a standard model of 4.4
No, DeepNetwork will not be part of 4.4. It will be part of 5.0

For question 1 and 2, I will go through it and get back to you. Thanks!

Nirmal-Neel commented 5 years ago

Hi @scorebot, 1) The reason for the misprediction is the operator in the SimplePredicate. Instead of lessThan and greaterOrEqual, it should be lessOrEqual and greaterThan. 2) Yes, wordSeparatorCharacterRE="(?u)\b\w\w+\b" is not correct. The default value should be \s as you mentioned.

Thanks for pointing out these issues. These will be resolved in the next release of Nyoka.

scorebot commented 5 years ago

@nyoka-pmml @Nirmal-Neel Thanks for your answers. I'm glad these problems can be fixed in the next release.

I can see there is already a draft page about AnomalyDetectionModel on the SourceForge PMML Project, but I don't find any page about DeepNetwork, is there any document about such PMML standard candidate?

nyoka-pmml commented 5 years ago

@scorebot , Yes, you will not find any draft page for DeepNetwork since it is not part of the latest schema. If you need, I can provide you some information which is currently used internally.

scorebot commented 5 years ago

@nyoka-pmml I will appreciate it if you can provide some info about DeepNetwork. I plan to implement both new models in my open source PMML scoring libraries.

nyoka-pmml commented 5 years ago

For DeepNetwork schema, please refer pmml44.xsd

DeepNetwork is the root element for a DeepNet model. Its child element is NetworkLayer, which contains information of each layer of the model.

The layerId attribute of NetworkLayer is unique for each layer and connectionLayerId(some layer's layerId) creates a link to the connected layer. Each NetworkLayer element has three child elements - LayerParameters, LayerWeights and LayerBias. LayerParameters has attributes for each layers (for the list of attributes, you can refer to the schema). LayerWeights and LayerBias hold the layer's weight and bias information which is represented in base64 string format. The format is - data:float32;base64,tbQ0P1JAQj+f4hI/Yt7OPmwCpD60eR8/MfycPpEy8D4=. The base64 string should be encoded in __LITTLE_ENDIAN__ order and it is prepended by data:(float32|float64);base64,.

(In LayerParameters, for inputDimesion and outputDimension, the batch size is not included.)

nyoka-pmml commented 5 years ago

Hi @scorebot,

1. The reason for the misprediction is the operator in the SimplePredicate. Instead of `lessThan` and `greaterOrEqual`, it should be `lessOrEqual` and `greaterThan`.

2. Yes, `wordSeparatorCharacterRE="(?u)\b\w\w+\b"` is not correct. The default value should be `\s` as you mentioned.

Thanks for pointing out these issues. These will be resolved in the next release of Nyoka.

These are resolved and released in Nyoka 3.3.0

SoftwareAG / nyoka

Some issues of models generated by the example notebooks #15