dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.3k stars 8.73k forks source link

Eliminate text parsing from feature importances and evaluation metrics #6091

Open hcho3 opened 4 years ago

hcho3 commented 4 years ago

Currently, important functions such as feature importances and evaluation metrics rely on parsing of text strings, specifically the text output from the model dump function. For example:

https://github.com/dmlc/xgboost/blob/68c55a37d9bb680fe435f1d011e5fea62be97d22/python-package/xgboost/core.py#L1797-L1832

https://github.com/dmlc/xgboost/blob/68c55a37d9bb680fe435f1d011e5fea62be97d22/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java#L509-L540

https://github.com/dmlc/xgboost/blob/68c55a37d9bb680fe435f1d011e5fea62be97d22/python-package/xgboost/training.py#L85-L91

https://github.com/dmlc/xgboost/blob/68c55a37d9bb680fe435f1d011e5fea62be97d22/jvm-packages/xgboost4j/src/main/java/ml/dmlc/xgboost4j/java/Booster.java#L240-L255

Also see https://github.com/dmlc/xgboost/issues/4665#issuecomment-532932603 https://github.com/dmlc/xgboost/issues/4665#issuecomment-532945623

We should aim to eliminate all such uses of text parsing, since a slight change in the text dump will cause all these functions to break.

Proposed replacement:

Now that we have a functioning JSON library as well as numeric printing function (charconv) in XGBoost, it should be doable.

trivialfis commented 3 years ago

Looked into this a little bit. The implementation isn't difficult, but depends on https://github.com/dmlc/xgboost/pull/6605 due to the use of feature names/types. I will try to figure out a better way to store those information.