Open sjl070707 opened 7 years ago
i recreated new container by putting the right version of dependencies both python and system
g++ gcc
pip install xgboost==0.6a2 dask-xgboost==0.1.0
python 3.5.4 via https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh miniconda3
the process is stuck at training stage. (booster)
it looks like an exception occurs and dask-xgboost is trying to handle this error. or just waiting a long long time to be synched
Traceback (most recent call last):
File "/work/miniconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "
the process is stuck at training stage. (booster)
Do we know why it's stuck? Does XGBoost provide logs here? At this stage Dask has done all of its work and has handed off control to XGBoost. Unfortunately we don't have any visibility into what's going on internally.
I have the same issue here using bioconda's xgboost 0.6.a2 and dask-xgboost from pip. Stuck at the same place. Where can I find the log for XGBoost?
@mrocklin What version of xgboost
do you use? I was wondering if the stuck issue is because of the newly released xgboost
? Do you compile the xgboost
from source or installed from conda?
@mrocklin What version of xgboost do you use? I was wondering if the stuck issue is because of the newly released xgboost? Do you compile the xgboost from source or installed from conda?
Perhaps. I don't use xgboost regularly and so don't have a standard version. I probably used whatever was recent when I first wrote this code.
Running the test suite today after conda-installing xgboost from the conda-forge channel I find that dataframe tests pass but that the dask.array test segfaults. I don't experience the same behavior as above. I'll try to go over it again with newer libraries sometime, but can't promise that this will happen any time soon. Any help from others would be welcome here.
Ah. did not notice that we have test code there. Great. I can play around with it to see what I can do.
PS: Can you also reopen this issue?
Thank you for your effort here @DigitalPig . Reopened
I found a very interesting thing when trying to figure out why it gets stuck. I ran test code on my Ubuntu 16.10 with conda equipped with python 3.6/xgboost from conda-forge/dask-xgboost from github. Everything seems to be fine. All tests pass and I use Titanic training data with dask_xgboost
and it successfully trained as well except complaining the "hist" option in params
.
But on my cluster, which is built based on AWS EC2 RHEL7. I cannot pass the test_numpy
. Also, my training got stuck using Titanic data as well. Same conda environment.
I am going to provision a Ubuntu cluster and see if I can reproduce the issue.
Indeed seeing tests have different behavior in the same conda environment is quite odd.
After provisioned a Ubuntu 14.04 cluster with same setup, all tests can pass now. The toy example using titanic can run through as well both under LocalCluster and real cluster.
I think this may due to the RHEL7 issue somewhere, although I am not sure where.
Also, it would be really great that we can grab output information from xgboost
during the process.
Also, it would be really great that we can grab output information from xgboost during the process.
Presumably this is passing through stdout or the Python logging module? Historically Dask has relied on cluster managers to handle logs. For LocalCluster you can start with LocalCluster(silence_logs=False)
to get output on stdout/stderr.
This comes up decently often enough that we might want to have some mechanism to stream logs back though. I'll ponder this, though it's unlikely to be solved immediately.
Thank you for tracking down the issue with system libraries by the way. Any thoughts on which dependency within Ubuntu 14.04 vs RHEL7 might be relevant here?
Not at this point... But I will spin up smaller clusters with two dist and dig a little bit more. Any finding from your side?
No finding from my side. To be honest I haven't looked into this problem much (there have been a few other things going on for me.) My apologies for not contributing here.
Hi there! having very similar problem. Using docker containers on a CentOS Linux release 7.3.1611 (Core), everything with dask/distributed seems to work fine (basic tests, dask grid search, joblib integration), but when using dxgb.train for very small train task, it never finishes. See some changes on dask UI, but then it stops.
Interestingly enough, dxgb.train runs fine locally on my windows docker env, but not on the centos docker env (distributed).
(using a docker image based on ogrisel/distributed)
FROM ogrisel/distributed
RUN pip install -force dask-xgboost
RUN conda install -y py-xgboost RUN conda install -y seaborn RUN conda install -y dask-searchcv -c conda-forge
Even I am encountering the same problem.
import dask
import dask.dataframe as dd
from dask.distributed import Client
import dask_xgboost as dxgb
client = Client('192.168.50.211:8786')
client.restart()
df = dd.read_csv("adult_comp_cont", storage_options={'anon' : True})
df = df[:100]
df.columns = [str(i) for i in range(6)] + ['target']
Y = df['target']
X = df.drop('target', axis=1)
x, y = dask.persist(X, Y)
params = {'objective' :'binary:logistic', 'n_estimators' : 10, 'max_depth' : 3, 'learning_rate' : 0.033}
dxgb.train(client, params, x, y)
It's getting stuck indefinitely.
It's getting stuck indefinitely.
What's happening on the dashboard at this point? Are you sure the data has finished loading with the call to .persist
?
Thanks for pointing out the mistake! I did this:
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
import dask_xgboost as dxgb
lc = LocalCluster(processes=False, scheduler_port=8989)
client = Client(lc.scheduler_address)
df = dd.read_csv("adult_comp_cont", storage_options={'anon' : True})
df = df[:100]
df.columns = [str(i) for i in range(6)] + ['target']
Y = df['target']
X = df.drop('target', axis=1)
params = {'objective' :'binary:logistic', 'n_estimators' : 10, 'max_depth' : 3, 'learning_rate' : 0.033}
dxgb.train(client, params, X, Y)
and got this successfully:
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[14:14:45] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=3
The issue of delay was because of some conflict between the scheduler and the workers, so I used LocalCluster
with process=False
I am encountering the same problem. Set up DASK-XGBOOST on kubernetes and small training set went through smoothly. Then tried 8Gb HIGGS, I am seeing the same error, i..e the job run for a while before entering training, then it's stuck in training forever and all workers cpu dropped to 2% to 0%. Logs are very limited.
Have anybody ever tried DASK+XGBOOST on large dataset that cannot fit in one machine's memory?
I am also facing the same issue. I am trying to run dask_xgboost with GPU option enabled on a large dataset.
The dask_xgboost is working fine when the dataset is small. When I tried with 10K, 100K, 1M data points, it worked perfectly. When I increased it to 10M, it's failing and the Dask dashboard is not responding. It's failing before entering the "train_part".
For your reference, I am using the below code for this experiment.
import dask.dataframe as dd
import dask_xgboost as dxgb
df = dd.read_csv("large_dataset.csv")
y = df['target']
X = df.drop(columns=['target'])
params = {'objective': 'reg:linear', 'nround': 1000,
'max_depth': 16, 'eta': 0.01, 'subsample': 0.5,
'min_child_weight': 1, 'tree_method': 'gpu_hist'}
bst = dxgb.train(client, params, X, y)
FYI: I am working on AWS EMR cluster. It can scale up to 21 nodes each having a capacity of ~2.2GB.
Please throw light on this. It would be good if you give me suggestions to make this work.
@mrocklin @TomAugspurger
XGBoost expects the dataset to fit comfortably in memory. Perhaps the dataset is larger than RAM in the way that XGBoost stores it? I would look at the Dask dashboard and see if worker memory was getting high.
On Thu, Apr 30, 2020 at 5:46 AM Abhishekmamidi notifications@github.com wrote:
I am also facing the same issue. I am trying to run dask_xgboost with GPU option enabled on a large dataset.
The dask_xgboost is working fine when the dataset is small. When I tried with 10K, 100K, 1M data points, it worked perfectly. When I increased it to 10M, it's failing and the Dask dashboard is not responding. It's failing before entering the "train_part".
For your reference, I am using the below code for this experiment.
import dask.dataframe as dd import dask_xgboost as dxgb df = dd.read_csv("large_dataset.csv") y = df['target'] X = df.drop(columns=['target']) params = {'objective': 'reg:linear', 'nround': 1000, 'max_depth': 16, 'eta': 0.01, 'subsample': 0.5, 'min_child_weight': 1, 'tree_method': 'gpu_hist'} bst = dxgb.train(client, params, X, y)
FYI: I am working on AWS EMR cluster. It can scale up to 21 nodes each having a capacity of ~2.2GB.
Please throw light on this. It would be good if you give me suggestions to make this work.
@mrocklin https://github.com/mrocklin @TomAugspurger https://github.com/TomAugspurger
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/2#issuecomment-621811398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTDXOXYHAGNDRAD6R7TRPFXLRANCNFSM4DCNNIYQ .
setup: dask 0.14 . (pip installed) xgboost 0.62 (conda installed) dask-xgboost 0.10.X (modified distributed.comm.addressing) for loading import dask_xgboost without error (https://github.com/dask/dask-xgboost/issues/1)
I was following the example here, https://gist.github.com/mrocklin/3696fe2398dc7152c66bf593a674e4d9
i produces the job, and looks like it runs for a few minutes.
however there would be some errors and would not finish nor crash my python code.
I wish I could provide more logs.