dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.28k stars 8.73k forks source link

distributed `hist` tree method not working? #4127

Closed rongou closed 5 years ago

rongou commented 5 years ago

I have a test that takes some rows from the mortgage dataset (https://rapidsai.github.io/demos/datasets/mortgage-data) and trains xgboost on spark. It uses the auto/approx tree method before and seems to work fine. I just built from master and tried the new hist method, but it seems to be giving me random predictions (AUC of 0.5). Has anyone tried the distributed hist method on a real dataset and does it work for you?

Here is the prediction from auto/approx:

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    0|(27,[10,11,12,13,...|[0.00272440910339...|[1.00272440910339...|       0.0|
|    0|(27,[10,11,12,13,...|[6.00278377532959...|[1.00060027837753...|       0.0|
|    0|(27,[10,11,12,13,...|[3.54290008544921...|[1.00035429000854...|       0.0|
|    0|(27,[10,11,12,13,...|[9.58383083343505...|[1.00095838308334...|       0.0|
|    0|(27,[10,11,12,13,...|[-4.1690468788146...|[0.99958309531211...|       0.0|
|    0|(27,[10,11,12,13,...|[0.00151866674423...|[1.00151866674423...|       0.0|
|    0|(27,[10,11,12,13,...|[0.00215685367584...|[1.00215685367584...|       0.0|
|    0|(27,[10,11,12,13,...|[-0.0041129589080...|[0.99588704109191...|       0.0|
|    0|(27,[10,11,12,13,...|[-7.7697634696960...|[0.99922302365303...|       0.0|
|    0|(27,[10,11,12,13,...|[1.54614448547363...|[1.00015461444854...|       0.0|
|    0|(27,[10,11,12,13,...|[-0.0034057796001...|[0.99659422039985...|       0.0|
|    0|(27,[10,11,12,13,...|[-0.0017806887626...|[0.99821931123733...|       0.0|
|    0|(27,[10,11,12,13,...|[-0.0025569796562...|[0.99744302034378...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.8969855308532...|[0.99941030144691...|       0.0|
|    0|(27,[10,11,12,13,...|[-3.9306282997131...|[0.99960693717002...|       0.0|
|    0|(27,[10,11,12,13,...|[0.00174278020858...|[1.00174278020858...|       0.0|
|    0|(27,[10,11,12,13,...|[-0.0033082067966...|[0.99669179320335...|       0.0|
|    1|(27,[10,11,12,13,...|[-0.9979411363601...|[0.00205886363983...|       1.0|
|    0|(27,[10,11,12,13,...|[0.00129508972167...|[1.00129508972167...|       0.0|
|    0|(27,[10,11,12,13,...|[-0.0039285719394...|[0.99607142806053...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows

(0.9779702151130722,117887121413ns,2450346809ns)

Note the AUC (0.9779702151130722) is close to 1.

Here is the exact same test with s/auto/hist/:

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    1|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    1|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    1|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
|    0|(27,[10,11,12,13,...|[-5.9604644775390...|[0.99999994039535...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows

(0.5,55247534229ns,2068507194ns)
CodingCat commented 5 years ago

We have multiple test cases guarding accuracy and I also tested with internal datasets, didn’t see the issue,

But if you use external memory, that would mess up the accuracy

CodingCat commented 5 years ago

can you check your driver log, and see if training error goes to the correct direction?

CodingCat commented 5 years ago

validate again , didn't find the issue

rongou commented 5 years ago

Thanks for checking. Yeah it might very well be something silly on our end. Let me dig a bit deeper to see if I can find the root cause.

Just for my education, is hist pretty much a drop-in replacement for approx? Is there anything I'd need to change in order to make one work versus the other?

CodingCat commented 5 years ago

It should be as easy as that

If you set use_external_memory to true in approx, you’d change to false as external memory in hist has some problem now

On Mon, Feb 11, 2019 at 6:00 PM Rong Ou notifications@github.com wrote:

Thanks for checking. Yeah it might very well be something silly on our end. Let me dig a bit deeper to see if I can find the root cause.

Just for my education, is hist pretty much a drop-in replacement for approx? Is there anything I'd need to change in order to make one work versus the other?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/4127#issuecomment-462581752, or mute the thread https://github.com/notifications/unsubscribe-auth/AApYeDO-ElSUwy3uVEqk-mnp4rZ5YYF-ks5vMiAwgaJpZM4a1Xak .

rongou commented 5 years ago

Still trying to debug this, I don't see anything obviously wrong on my end. I hard coded verbosity to 3 in the c++ code, and seeing lots of these lines:

[15:16:31] INFO: /home/rou/src/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0

That doesn't seem quite right, does it?

RAMitchell commented 5 years ago

This indicates no tree is being grown - there is only a root with a single weight. Does this occur from the first tree or later in the boosting process?

CodingCat commented 5 years ago

this actually reminds me something

@rongou can you print the value of gmat.cut.row_ptr before the following line?

https://github.com/dmlc/xgboost/blob/master/src/tree/updater_quantile_hist.cc#L632

rongou commented 5 years ago

Their values are always 0, 1.

CodingCat commented 5 years ago

ok, would you please enumerate all values in row_ptr?

rongou commented 5 years ago

gmat.cut.row_ptr=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 30, 155, 160, 208, 260, 263, 309, 426, 429, 570, 574, 627, 880, 913, 1040, 1125, 1131]

CodingCat commented 5 years ago

looks like the way to get stats of a node in distributed hist is too approximate....

I will come up a similar approach with approx

CodingCat commented 5 years ago

you can apply this patch https://github.com/CodingCat/xgboost/pull/10 to resolve the issue

I will file a PR to master branch once https://github.com/dmlc/xgboost/pull/4102 is merged

rongou commented 5 years ago

Patch worked, thanks!