Closed dev7mt closed 4 years ago
Is this problem confined to tree_method=hist
? Did you try exact
or approx
?
I tried using the approx
method, it works fine then. Although the results are worse and training takes more time. As mentioned in the code above (+ blue line on the plot):
model3 = xgb.train(
params={'eta': ETA_BASE, "tree_method": "approx", "booster": "gbtree", "silient": 0},
[...]
learning_rates=ETA_DECAY
)
I did not try the exact method.
Memory leakage may be probable. Let me look at it after 0.80 release.
I've had this issue before. I don't know exactly what is happening, but I found a workaround.
While studying it I found that the learning_rates parameter in xgb.train actually calls a reset_learning_rate Callback. Then I tried using other custom Callbacks and I saw this memory leak as well. It looks as if once you call any Callback other than the print Callback, it causes the tree to re-initialize at every iteration.
My workaround was to add a "learning_rate_schedule" dmc parameter and then set the new learning rate in the at the beginning of each iteration. It involved quite a bit of modification of the c++ code. Also, I saw this problem in gpu_hist as well. So, I edited the cuda code too. In the end my solution resets the learning rate without Callbacks. And it works.
@hcho3 0.80 is released, did you have a chance to look at this leakage?
@Denisevi4 can you share the code for that?
@Denisevi4 For the CUDA gpu_hist, did you find ~usual~ unusual memory usage in GPU memory, or just CPU memory? I'm currently spending time on gpu-hist, see if I can dig something out.
@dev7mt @Denisevi4 @kretes @trivialfis I think I found the cause of the memory leak. When the learning rate decay is enabled, FastHistMaker::Init()
is called every iteration, where it should have been called only in the first iteration. The initialization function FastHistMaker::Init()
allocates new objects, hence the rising memory usage over time.
I'll try to come up with a fix so that FastHistMaker::Init()
is called only once.
Here is a snippet of diagnostic logs I injected.
Learning rate decay enabled:
xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[0] test-rmse:7.7284
xgboost/src/c_api/c_api.cc:869: XGBoosterSetParam(): name = learning_rate, value = 0.2777777777777778
Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
xgboost/src/tree/updater_fast_hist.cc:50: FastHistMaker::Init()
xgboost/src/tree/updater_prune.cc:24: TreePruner()
xgboost/src/tree/updater_fast_hist.cc:72: FastHistMaker::Update(): is_gmat_initialized_ = false
xgboost/src/common/hist_util.cc:127: GHistIndexMatrix::Init()
xgboost/src/tree/../common/column_matrix.h:72: ColumnMatrix::Init()
xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[1] test-rmse:6.82093
xgboost/src/c_api/c_api.cc:869: XGBoosterSetParam(): name = learning_rate, value = 0.25555555555555554
Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
xgboost/src/tree/updater_fast_hist.cc:50: FastHistMaker::Init()
xgboost/src/tree/updater_fast_hist.cc:72: FastHistMaker::Update(): is_gmat_initialized_ = false
xgboost/src/common/hist_util.cc:127: GHistIndexMatrix::Init()
xgboost/src/tree/../common/column_matrix.h:72: ColumnMatrix::Init()
xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
Learning rate decay disabled
xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[0] test-rmse:7.72278
xgboost/src/tree/updater_fast_hist.cc:72: FastHistMaker::Update(): is_gmat_initialized_ = true
xgboost/src/tree/updater_prune.cc:75: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[1] test-rmse:6.75087
Diagnosis The learning rate decay callback function calls XGBoosterSetParam()
to update the learning rate. The XGBoosterSetParam()
function in turn calls Learner::Configure()
, which re-initializes each tree updater (calling FastHistMaker::Init()
). The FastHistMaker
updater maintains extra objects that are meant to be recycled across iterations, and re-initialization wastes memory by duplicating those internal objects.
@dev7mt @Denisevi4 @kretes @trivialfis Fix is available at #3803.
@dev7mt @Denisevi4 @kretes The next upcoming release (version 0.81) will not include a fix for the memory leak issue. The reason is that the fix is only temporary, adds a lot of maintenance burden, and it will be supplanted by a future code re-factor. For now, you should use approx
and exact
when using learning rate decay. Alternatively, checkout the eta_decay_memleak
branch from my fork.
@hcho3
The FastHistMaker updater maintains extra objects that are meant to be recycled across iterations, and re-initialization wastes memory by duplicating those internal objects.
Could you be more specific about which object? I'm trying to do parameter update, may just fix this on the way...
Hey,
I have been trying to implement in my project eta_decay that's quite specific to my needs, but I kept running into OutOfMemory errors. After a bit of digging, I've found out, that setting learning rate, while using the "hist" tree_method seems to cause the same issue. It led me to believe that the callback itself is not the problem here.
I have tested this issue on multiple environments (two different setups of Ubuntu - on premise and cloud and macOS), and it always produced similar errors.
The code below should reproduce the issue:
Attached plot from my run of the code above.
I did no digging into the underlying code (cpp), but a memory leakage seems plausible. As I understand this is not the desired behaviour, but maybe this method requires such amounts of memory.