Open trieuat opened 4 years ago
Do you know why that might be?
Do things work if you use the native dask integration? https://xgboost.readthedocs.io/en/latest/tutorials/dask.html
I don't know why. The native dask integration in the link can train from existing model.
However, I have a different problem with it. Its performance is just like random in distributed environment vs. good performance from Dask xgboost with the same parameters and data.
I'm not sure why that would be, but the usual recommendation is to create a performance report: https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports
On Mon, Mar 16, 2020 at 6:45 PM tuanatrieu notifications@github.com wrote:
I don't know why. The native dask integration in the link can train from existing model.
However, I have a different problem with it. Its performance is just like random in distributed environment vs. good performance from Dask xgboost with the same parameters and data.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/70#issuecomment-599807270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIVHPPOY5YKVF6KTKFLRH22ZJANCNFSM4LKICDDQ .
I can create a performance report, but the thing is that the training does not seem to happen and never finish even if I build just one tree. CPU usage in all workers are %2-6% (vs. ~ +100% CPU usage if I remove the parameter xgb_model
). If you have any suggestion for how to debug it, please let me know.
Just a blind guess, have you tried deleting dask-worker-space
and storage
directories that Dask creates?
They will be wherever temporary-directory
is set to. This most likely would be set in ~/.config/dask/dask.yaml
, but could be configured in other places depending on what your code might be doing. If unspecified, these will be in the same directory you ran the script or notebook from.
Thank for the suggestion. I had dask-worker-space
in my folder, after removing it, I can train with several workers on a small dataset. So, I moved to submit the job with skein instead of using my edgnode as client. But, when I increase the dataset size (< 1Gb), it does not train again. Looking at the log, I can see only one worker started the hist
algorithm but did not progress to build any tree. And nothing happened at other workers.
When continuing training xgboost from an existing model in distributed environment with more than 3 workers, xgboost does not train: nothing happens in workers and it never finishes. But in local cluster or distributed cluster with less than 3 workers, the training happens and finishes.
dxgb.train(client, params, X_train, y_train, xgb_model=existing_model,...)