xgboost does not train from existing model in distributed environment

dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License

162 stars 43 forks source link

xgboost does not train from existing model in distributed environment #70

Open trieuat opened 4 years ago

trieuat commented 4 years ago

When continuing training xgboost from an existing model in distributed environment with more than 3 workers, xgboost does not train: nothing happens in workers and it never finishes. But in local cluster or distributed cluster with less than 3 workers, the training happens and finishes.

dxgb.train(client, params, X_train, y_train, xgb_model=existing_model,...)

TomAugspurger commented 4 years ago

Do you know why that might be?

Do things work if you use the native dask integration? https://xgboost.readthedocs.io/en/latest/tutorials/dask.html

trieuat commented 4 years ago

I don't know why. The native dask integration in the link can train from existing model.

However, I have a different problem with it. Its performance is just like random in distributed environment vs. good performance from Dask xgboost with the same parameters and data.

TomAugspurger commented 4 years ago

I'm not sure why that would be, but the usual recommendation is to create a performance report: https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports

On Mon, Mar 16, 2020 at 6:45 PM tuanatrieu notifications@github.com wrote:

I don't know why. The native dask integration in the link can train from existing model.

However, I have a different problem with it. Its performance is just like random in distributed environment vs. good performance from Dask xgboost with the same parameters and data.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/70#issuecomment-599807270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIVHPPOY5YKVF6KTKFLRH22ZJANCNFSM4LKICDDQ .

trieuat commented 4 years ago

I can create a performance report, but the thing is that the training does not seem to happen and never finish even if I build just one tree. CPU usage in all workers are %2-6% (vs. ~ +100% CPU usage if I remove the parameter xgb_model). If you have any suggestion for how to debug it, please let me know.

jakirkham commented 4 years ago

Just a blind guess, have you tried deleting dask-worker-space and storage directories that Dask creates?

They will be wherever temporary-directory is set to. This most likely would be set in ~/.config/dask/dask.yaml, but could be configured in other places depending on what your code might be doing. If unspecified, these will be in the same directory you ran the script or notebook from.

trieuat commented 4 years ago

Thank for the suggestion. I had dask-worker-space in my folder, after removing it, I can train with several workers on a small dataset. So, I moved to submit the job with skein instead of using my edgnode as client. But, when I increase the dataset size (< 1Gb), it does not train again. Looking at the log, I can see only one worker started the hist algorithm but did not progress to build any tree. And nothing happened at other workers.