combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 310 forks source link

xgboost predicted probabilities match upto only 7 significant digits #604

Closed talalryz closed 4 years ago

talalryz commented 4 years ago

For my xgboost model, predicted probabilities for the non-serialized and serialized model match only upto 7 significant digits. I also noticed that in the xgboost tests for mleap, we're only checking that the predicted differ by < 1e-7 here: https://github.com/combust/mleap/blob/7b40eb7bf00802a0f3dc564bca61e7736bcd0404/mleap-xgboost-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala#L103

Do we understand what is causing this discrepancy?

talalryz commented 4 years ago

This is happening because scala floats are 32 bit IEEE-754, so anything after 7 significant digits is made up!! https://www.scala-lang.org/api/2.12.3/scala/Float.html

lucagiovagnoli commented 4 years ago

@talalryz it's not an option to use doubles ? isn't this a Double tensor ? https://github.com/combust/mleap/blob/7b40eb7bf00802a0f3dc564bca61e7736bcd0404/mleap-xgboost-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/mleap/XGBoostClassificationModelParitySpec.scala#L99

talalryz commented 4 years ago

@lucagiovagnoli The problem lies in dmlc xgboost itself. The booster returns a scala Float, which then gets cast into a Double by mleap. So, anything after the first 7 significant digits is garbage.

The call to the dmlc xgboost booster is done here: https://github.com/combust/mleap/blob/7b40eb7bf00802a0f3dc564bca61e7736bcd0404/mleap-xgboost-runtime/src/main/scala/ml/combust/mleap/xgboost/runtime/XGBoostClassificationModel.scala#L39