dmitryikh / leaves

pure Go implementation of prediction part for GBRT (Gradient Boosting Regression Trees) models from popular frameworks
MIT License
419 stars 72 forks source link

read XGBoost model from JSON #27

Closed dmitryikh closed 5 years ago

dmitryikh commented 5 years ago

Booster has dump_model method:

Help on method dump_model in module xgboost.core:

dump_model(fout, fmap='', with_stats=False, dump_format='text') method of xgboost.core.Booster instance
    Dump model into a text or JSON file.

    Parameters
    ----------
    fout : string
        Output file name.
    fmap : string, optional
        Name of the file containing feature map names.
    with_stats : bool, optional
        Controls whether the split statistics are output.
    dump_format : string, optional
        Format of model dump file. Can be 'text' or 'json'.

dump_format='text' output:

booster[0]:
0:[f29<-9.53674316e-07] yes=1,no=2,missing=1,gain=4000.53101,cover=1628.25
        1:[f56<-9.53674316e-07] yes=3,no=4,missing=3,gain=1158.21204,cover=924.5
                3:[f60<-9.53674316e-07] yes=7,no=8,missing=7,gain=568.215576,cover=812
                        7:[f23<-9.53674316e-07] yes=13,no=14,missing=13,gain=142.803711,cover=772.5
                                13:[f24<-9.53674316e-07] yes=19,no=20,missing=19,gain=138.515137,cover=763
                                        19:leaf=0.199735105,cover=754
                                        20:leaf=-0.179999992,cover=9
                                14:leaf=-0.180952385,cover=9.5
                        8:leaf=-0.195061728,cover=39.5
                4:[f21<-9.53674316e-07] yes=9,no=10,missing=9,gain=114.297333,cover=112.5
                        9:leaf=0.177777782,cover=8
                        10:leaf=-0.198104262,cover=104.5
        2:[f109<-9.53674316e-07] yes=5,no=6,missing=5,gain=198.173828,cover=703.75
                5:[f67<-9.53674316e-07] yes=11,no=12,missing=11,gain=86.3969727,cover=690.5
                        11:[f8<-9.53674316e-07] yes=15,no=16,missing=15,gain=13.9060059,cover=679.75
                                15:leaf=-0.199117333,cover=678.75
                                16:leaf=0.100000001,cover=1
                        12:[f39<-9.53674316e-07] yes=17,no=18,missing=17,gain=28.7762947,cover=10.75
                                17:leaf=0.177142859,cover=7.75
                                18:leaf=-0.150000006,cover=3
                6:leaf=0.185964927,cover=13.25
booster[1]:
0:[f29<-9.53674316e-07] yes=1,no=2,missing=1,gain=3273.53296,cover=1612.29773
        1:[f56<-9.53674316e-07] yes=3,no=4,missing=3,gain=947.959351,cover=915.424927
                3:[f60<-9.53674316e-07] yes=7,no=8,missing=7,gain=466.366455,cover=804.00647
                        7:[f23<-9.53674316e-07] yes=13,no=14,missing=13,gain=118.102295,cover=764.879822
                                13:[f24<-9.53674316e-07] yes=17,no=18,missing=17,gain=114.618896,cover=755.457153
                                        17:leaf=0.181651443,cover=746.529663
                                        18:leaf=-0.165040284,cover=8.92749214
                                14:leaf=-0.165846676,cover=9.42265606
                        8:leaf=-0.177735806,cover=39.1266365
                4:[f21<-9.53674316e-07] yes=9,no=10,missing=9,gain=94.7408752,cover=111.418503
                        9:leaf=0.163156673,cover=7.93712139
                        10:leaf=-0.180286244,cover=103.481377
        2:[f109<-9.53674316e-07] yes=5,no=6,missing=5,gain=163.45752,cover=696.872803
                5:[f67<-9.53674316e-07] yes=11,no=12,missing=11,gain=70.9365234,cover=683.736755
                        11:leaf=-0.180530906,cover=673.064026
                        12:[f39<-9.53674316e-07] yes=15,no=16,missing=15,gain=24.4080353,cover=10.6727066
                                15:leaf=0.162618011,cover=7.68951893
                                16:leaf=-0.139356777,cover=2.98318791
                6:leaf=0.170082554,cover=13.1361017
...

dump_format=json output:

[
  { "nodeid": 0, "depth": 0, "split": "f29", "split_condition": -9.53674316e-07, "yes": 1, "no": 2, "missing": 1, "gain": 4000.53101, "cover": 1628.25, "children": [
    { "nodeid": 1, "depth": 1, "split": "f56", "split_condition": -9.53674316e-07, "yes": 3, "no": 4, "missing": 3, "gain": 1158.21204, "cover": 924.5, "children": [
      { "nodeid": 3, "depth": 2, "split": "f60", "split_condition": -9.53674316e-07, "yes": 7, "no": 8, "missing": 7, "gain": 568.215576, "cover": 812, "children": [
        { "nodeid": 7, "depth": 3, "split": "f23", "split_condition": -9.53674316e-07, "yes": 13, "no": 14, "missing": 13, "gain": 142.803711, "cover": 772.5, "children": [
          { "nodeid": 13, "depth": 4, "split": "f24", "split_condition": -9.53674316e-07, "yes": 19, "no": 20, "missing": 19, "gain": 138.515137, "cover": 763, "children": [
            { "nodeid": 19, "leaf": 0.199735105, "cover": 754 },
            { "nodeid": 20, "leaf": -0.179999992, "cover": 9 }
          ]},
          { "nodeid": 14, "leaf": -0.180952385, "cover": 9.5 }
        ]},
        { "nodeid": 8, "leaf": -0.195061728, "cover": 39.5 }
      ]},
      { "nodeid": 4, "depth": 2, "split": "f21", "split_condition": -9.53674316e-07, "yes": 9, "no": 10, "missing": 9, "gain": 114.297333, "cover": 112.5, "children": [
        { "nodeid": 9, "leaf": 0.177777782, "cover": 8 },
        { "nodeid": 10, "leaf": -0.198104262, "cover": 104.5 }
      ]}
    ]},
    { "nodeid": 2, "depth": 1, "split": "f109", "split_condition": -9.53674316e-07, "yes": 5, "no": 6, "missing": 5, "gain": 198.173828, "cover": 703.75, "children": [
      { "nodeid": 5, "depth": 2, "split": "f67", "split_condition": -9.53674316e-07, "yes": 11, "no": 12, "missing": 11, "gain": 86.3969727, "cover": 690.5, "children": [
        { "nodeid": 11, "depth": 3, "split": "f8", "split_condition": -9.53674316e-07, "yes": 15, "no": 16, "missing": 15, "gain": 13.9060059, "cover": 679.75, "children": [
          { "nodeid": 15, "leaf": -0.199117333, "cover": 678.75 },
          { "nodeid": 16, "leaf": 0.100000001, "cover": 1 }
        ]},
        { "nodeid": 12, "depth": 3, "split": "f39", "split_condition": -9.53674316e-07, "yes": 17, "no": 18, "missing": 17, "gain": 28.7762947, "cover": 10.75, "children": [
          { "nodeid": 17, "leaf": 0.177142859, "cover": 7.75 },
          { "nodeid": 18, "leaf": -0.150000006, "cover": 3 }
        ]}
      ]},
      { "nodeid": 6, "leaf": 0.185964927, "cover": 13.25 }
    ]}
  ]},
...

unfortunately, these outputs are not enough to restore the model: we miss number of classes, transformation function, dart's dropout weights, etc..

Seems like this is one way only transformation..

Also, some information here: https://stackoverflow.com/questions/43691380/how-to-save-load-xgboost-model

dmitryikh commented 5 years ago

Seems that xgboost json format doesn't support the model to be fully restored from it. See the previous post.