dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.03k stars 8.69k forks source link

[Roadmap] Phasing out the support for old binary format. #7547

Open trivialfis opened 2 years ago

trivialfis commented 2 years ago

XGBoost has a custom binary model format that has been used since day 1. Later in 1.0, we introduced the JSON format as an alternative, which has a schema and has better extensibility. The JSON format has been used as a default format for memory snapshot serialization (pickle, rds, etc) and has extra features including categorical data support, extra data feature names, and features types. However, for performance and compatibility reasons we have continued the support for the old binary format. In 1.6 we plan to add universal binary JSON as an extension to the current JSON format also as a replacement for the old binary format.

Motivation

The old binary format is essentially copying internal structures like parameters, tree nodes into a memory buffer, so it has a fixed memory layout that's difficult to change and debug. If we look at the Learner class it's full of conditions to work around some issues in binary format accumulated over the past. These issues root from the situation that we can not change the binary output in any way, which also has an indirect impact on how we write code. For instance, we can not change the RegTree structure due to how the node is stored in the output and it's the very core of XGBoost. To overcome these issues and clear some room for future development we need to phase out its use.

Roadmap

If the Universal Binary JSON implementation is accepted, I propose the following roadmap for phasing out the support of the old binary format:

note

trivialfis commented 2 years ago

@hcho3

hcho3 commented 2 years ago

This is necessary since the default_left is changed from boolean to integer.

How necessary is this? Was the default_left changed to improve performance?

trivialfis commented 2 years ago

Yes. Most of the improvement comes from the typed array where we can omit the construction of Json struct and the guessing work for the next element. But there's no typed array for boolean.

Actually, there is, but it's not quite useful. The representation of boolean is T and F characters, they are both type and value at the same time. So if we were to have a typed boolean array, the whole array would be either true or false.

We can continue the support for the current JSON model for a very long time since the additional code is not much (1 condition to check whether it's bool or int), but I think it's also quite easy to move away from it since users can simply replace True to 1 and False to 0 in the JSON file. I can create a script for doing just that.

mpetricek-corp commented 1 year ago

Is there a simple way to silence this warning "Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3." when using the java interface? I.e. ml.dmlc.xgboost4j.java.XGBoost class from ml.dmlc.xgboost-jvm_2.12 maven artifact.

The C code seems to accept some "verbosity" configuration, but so far I have not found way to set this config from the Java code.