[DISCUSSION] Adopting JSON-like format as next-generation model format

dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

https://xgboost.readthedocs.io/en/stable/

Apache License 2.0

26.07k stars 8.7k forks source link

[DISCUSSION] Adopting JSON-like format as next-generation model format #3916

Closed trivialfis closed 5 years ago

trivialfis commented 5 years ago

As discussed in #3878 and #3886 , we might want a more extendable format for saving XGBoost model.

For now my plan is utilizing the JSONReader and JSONWriter implemented in dmlc-core to add experimental support for saving/loading model into Json file. Due to the fact that related area of code is quite messy and is dangerous to change, I want to share my plan and possibly an early PR as soon as possible so that someone could point out my mistakes earlier(there will be mistakes), and we don't make duplicated work. :)

@hcho3

hcho3 commented 5 years ago

@KOLANICH The named keys are essential to allow for future addition of parameters. For me, this is the by and large the most important motivation for using JSON for saving models. Compatness can be achieved by using binary encoding of JSON (to be added later).

@tqchen I'll look into using DMLC JSON parser.

trivialfis commented 5 years ago

@hcho3 My current priority is #3952 but I will get back to JSON as soon as possible. I tried using json from dmlc before, it can be used for saving KVStore you created, but with some more handing code. Most of the hurdles lie in using JSONReader.

trivialfis commented 5 years ago

And binary json also needs to be handled.

tqchen commented 5 years ago

Minimum dependency is indeed important and I want to emphasize it. Because nothing is free and we need to make sure the project can be easily ported into various platforms. Most of json libraries assume schema-less model, which adds overhead. In our case, we do have a schema in the model, in which case dmlc's json parser is sufficient to do the job

tqchen commented 5 years ago

Given that format is something that will affect the usage quite a lot, maybe let us open an RFC on the json format proposal? We need to properly document it anyway @trivialfis @hcho3 thanks for taking lead in this

hcho3 commented 5 years ago

Sure, I’ll draft a RFC in next few days. I think I have a good idea about how to achieve the desired goals without introducing 3rd party dependency.

I will close this issue once the RFC is up.

hcho3 commented 5 years ago

Closing this now. The RFC document is available at #3980.