dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

[bug] Dask worker dies during dask-xgboost classifier training : test_core.py::test_classifier #68

Open pradghos opened 4 years ago

pradghos commented 4 years ago

Dask worker dies while during dask-xgboost classifier training ; It is being observed while running test_core.py::test_classifier

Configuration used -

Dask Version: 2.9.2
Distributed Version: 2.9.3
XGBoost Version: 0.90
Dask-XGBoost Version: 0.1.9
OS-release : 4.14.0-115.16.1.el7a.ppc64le

Description / Steps - :-

  1. Test create two cluster -
    
    > /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/test_core.py(38)test_classifier()
    -> with cluster() as (s, [a, b]):
    (Pdb) n
    distributed.scheduler - INFO - Clear task state
    distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:45767
    distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:40743
    distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:40743
    distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45767
    distributed.worker - INFO - -------------------------------------------------
    distributed.worker - INFO -               Threads:                          1
    distributed.worker - INFO -                Memory:                  612.37 GB
    distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-c6ea91c7-746e-4c7a-9c13-f5afcd244966/worker-ebbqtfdu
    distributed.worker - INFO - -------------------------------------------------
    distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:33373
    distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:33373
    distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45767
    distributed.worker - INFO - -------------------------------------------------
    distributed.worker - INFO -               Threads:                          1
    distributed.worker - INFO -                Memory:                  612.37 GB
    distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-050815d2-54f6-4edc-9a03-dd075213449d/worker-i1yr8xvc
    distributed.worker - INFO - -------------------------------------------------
    distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 0, processing: 0>
    distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:40743
    distributed.core - INFO - Starting established connection
    distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45767
    distributed.worker - INFO - -------------------------------------------------
    distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33373', name: tcp://127.0.0.1:33373, memory: 0, processing: 0>
    distributed.core - INFO - Starting established connection
    distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33373
    distributed.core - INFO - Starting established connection
    distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45767
    distributed.worker - INFO - -------------------------------------------------
    distributed.core - INFO - Starting established connection

2. After couple of steps - fit is being called for dask-xgboost - 

-> a.fit(X2, y2) (Pdb) distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 ndistributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373

distributed.worker - DEBUG - Execute key: array-original-8d35e675b41aad38dc334c7f79ea1982 worker: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Send compute response to scheduler: array-original-8d35e675b41aad38dc334c7f79ea1982, {'op': 'task-finished', 'status': 'OK', 'nbytes': 80, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2651937, 'stop': 1580372953.265216, 'thread': 140735736705456, 'key': 'array-original-8d35e675b41aad38dc334c7f79ea1982'} distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 0) worker: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2696354, 'stop': 1580372953.2696435, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 0)"} distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 1) worker: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2705007, 'stop': 1580372953.2705073, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 1)"} distributed.worker - DEBUG - Deleted 1 keys distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2753158, 'stop': 1580372953.275466, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0)"} distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2762377, 'stop': 1580372953.2763371, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1)"} distributed.worker - DEBUG - Deleted 1 keys distributed.worker - DEBUG - Deleted 1 keys distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2805014, 'stop': 1580372953.2805073, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 0)"} distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2813187, 'stop': 1580372953.2813244, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 1)"} distributed.worker - DEBUG - Deleted 1 keys distributed.worker - DEBUG - Deleted 1 keys


Dask worker dies - 

distributed.worker - DEBUG - Deleted 1 keys distributed.worker - DEBUG - Deleted 1 keys distributed.worker - DEBUG - Deleted 1 keys /mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm # noqa: F401 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat skipped: channel busy distributed.worker - DEBUG - Heartbeat skipped: channel busy distributed.worker - INFO - Run out-of-band function 'start_tracker' distributed.worker - DEBUG - Deleted 1 keys distributed.worker - DEBUG - Deleted 1 keys distributed.worker - DEBUG - Deleted 1 keys /mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm # noqa: F401 /mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm # noqa: F401 distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 1, processing: 1> distributed.core - INFO - Removing comms to tcp://127.0.0.1:40743 ===========================>>> One worker dies /mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache)) distributed.worker - DEBUG - Execute key: train_part-e17e49e3769aaa4870dc8cc01a1e015e worker: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING === One worker is running infinitely distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373 distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373



It is not clear why does dask worker die at that point . 

Thanks!
Pradipta 
pradghos commented 4 years ago

If I remove sparse package coming from conda-forge from my environment - Dask worker is working fine and able to finish the task instead of dying -

removing sparse conda package -

conda remove sparse
Collecting package metadata (repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda

## Package Plan ##

  environment location: /mnt/pai/home/pradghos/anaconda3/envs/gdf37

  removed specs:
    - sparse

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         156 KB
    ------------------------------------------------------------
                                           Total:         156 KB

The following packages will be REMOVED:

  llvmlite-0.31.0-py37hd408876_0
  numba-0.47.0-py37h962f231_0
  sparse-0.9.1-py_0
  tbb-2019.9-h1bb5118_1

The following packages will be UPDATED:

  openssl            conda-forge::openssl-1.1.1d-h6eb9509_0 --> pkgs/main::openssl-1.1.1d-h7b6447c_3

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi                                       conda-forge --> pkgs/main

Proceed ([y]/n)? y

Downloading and Extracting Packages
certifi-2019.11.28   | 156 KB    | ################################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Then success log -

(gdf37) [pradghos@dlw11 tests]$ pytest --trace -v test_core.py::test_classifier
==================================================================== test session starts =====================================================================
platform linux -- Python 3.7.6, pytest-5.3.4, py-1.8.1, pluggy-0.13.1 -- /mnt/pai/home/pradghos/anaconda3/envs/gdf37/bin/python
cachedir: .pytest_cache
rootdir: /mnt/pai/home/pradghos/dask-xgboost, inifile: setup.cfg
collected 1 item

test_core.py::test_classifier
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB runcall (IO-capturing turned off) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/test_core.py(38)test_classifier()
-> with cluster() as (s, [a, b]):
(Pdb) n
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:46179
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:34459
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:34459
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:46179
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-071bff45-4e7d-4cb5-ae4d-5d77ec15ef20/worker-ozjlqw1m
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:33495
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:33495
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:46179
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-f8007b01-22e5-4a6e-b100-a4efbade1d80/worker-cib_tomi
distributed.worker - INFO - -------------------------------------------------

fit log -

-> a.fit(X2, y2)
(Pdb) distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
n
distributed.worker - DEBUG - Execute key: array-original-8d35e675b41aad38dc334c7f79ea1982 worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: array-original-8d35e675b41aad38dc334c7f79ea1982, {'op': 'task-finished', 'status': 'OK', 'nbytes': 80, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0739253, 'stop': 1580374654.0739493, 'thread': 140735091896752, 'key': 'array-original-8d35e675b41aad38dc334c7f79ea1982'}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 0) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0785978, 'stop': 1580374654.078607, 'thread': 140735091896752, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 0)"}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 1) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0801446, 'stop': 1580374654.080152, 'thread': 140735091896752, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0842004, 'stop': 1580374654.0843685, 'thread': 140735091896752, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0857737, 'stop': 1580374654.0858817, 'thread': 140735091896752, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.088994, 'stop': 1580374654.089002, 'thread': 140735091896752, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0944228, 'stop': 1580374654.0944307, 'thread': 140735091896752, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
...
...
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - INFO - Run out-of-band function 'start_tracker'
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.worker - DEBUG - Execute key: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Execute key: train_part-140acf4f99cbae1677f5d995d3ac0e1e worker: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
[02:57:35] WARNING: /opt/anaconda/conda-bld/xgboost-base_1579835034723/work/src/learner.cc:622: Tree method is automatically selected to be 'approx' for distributed training.[02:57:35] WARNING: /opt/anaconda/conda-bld/xgboost-base_1579835034723/work/src/learner.cc:622: Tree method is automatically selected to be 'approx' for distributed training.

[02:57:35] Tree method is automatically selected to be 'approx' for distributed training.[02:57:35] Tree method is automatically selected to be 'approx' for distributed training.

distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - future state: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 - RUNNING
distributed.worker - DEBUG - future state: train_part-140acf4f99cbae1677f5d995d3ac0e1e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - future state: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 - RUNNING
distributed.worker - DEBUG - future state: train_part-140acf4f99cbae1677f5d995d3ac0e1e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - future state: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 - RUNNING
distributed.worker - DEBUG - future state: train_part-140acf4f99cbae1677f5d995d3ac0e1e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - future state: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 - RUNNING
distributed.worker - DEBUG - future state: train_part-140acf4f99cbae1677f5d995d3ac0e1e - RUNNING

train_part part of XGBoost distributed workload runs fine in both the worker.

pradghos commented 4 years ago

Any pointer if sparse package coming conda-forge is not compatible with dask-xgboost or xgboost or what could be the reason behind dask worker's dying ? It would really help !

TomAugspurger commented 4 years ago

Can you provide a minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

I'm not sure why the presence of sparse would matter. We do import it, and it imports numba, but only use it when passed a sparse array.

On Thu, Jan 30, 2020 at 3:10 AM Pradipta Ghosh notifications@github.com wrote:

Any pointer if sparse package coming conda-forge is not compatible with dask-xgboost or xgboost or what could be the reason behind dask worker's dying ? It would really help !

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/68?email_source=notifications&email_token=AAKAOIXQDPHR2YHJZOSNYOTRAKKQDA5CNFSM4KNSALBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKKHNMY#issuecomment-580155059, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOISKMXUWSZ3G4KBSVBLRAKKQDANCNFSM4KNSALBA .

pradghos commented 4 years ago

It is easily reproducible with the test case ; we have in dask-xgboost test case - pytest -v test_core.py::test_classifier

Code snippet :-

def test_classifier(loop):  # noqa
    with cluster() as (s, [a, b]):
        with Client(s["address"], loop=loop):
            a = dxgb.XGBClassifier()
            X2 = da.from_array(X, 5)
            y2 = da.from_array(y, 5)
            a.fit(X2, y2)   ====> It hanging here. 
            p1 = a.predict(X2)

    b = xgb.XGBClassifier()
    b.fit(X, y)
    np.testing.assert_array_almost_equal(
        a.feature_importances_, b.feature_importances_
    )
    assert_eq(p1, b.predict(X))

As I have mentioned earlier whenever we have sparse conda package ; Hang is being observed - because one dask worker died.

Please let me know if you need any other information.

Thanks!

TomAugspurger commented 4 years ago

You also have scipy installed?

The sparse code is fairly self contained, just https://github.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py#L17-L22 and https://github.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py#L50-L63. Are you able to step through those and see where things go wrong?