dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

[QST] test_numpy() fail with "rabit::Init is already called in this thread" #47

Open ksangeek opened 5 years ago

ksangeek commented 5 years ago

I am using dask-xgboost 0.1.7 with xgboost 0.82. test_core.py::test_numpy was failing for me and I looked more into the failure and this is my understanding. I am bit amused as these tests were passing for me the last week and AFAIR with the same version of packages )! Need some help to understand what is going on here.

  1. test_core.py::test_numpy failed with rabit::Init is already called in this thread. And these are the details from pdb -
$ pytest test_core.py::test_numpy
====================================== test session starts =======================================
platform linux -- Python 3.6.8, pytest-4.6.2, py-1.8.0, pluggy-0.12.0
rootdir: ./tests
plugins: cov-2.7.1, forked-1.0.2, xdist-1.28.0
collected 1 item

test_core.py
>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB set_trace (IO-capturing turned off) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> ./tests/test_core.py(200)test_numpy()
-> dX = da.from_array(X, chunks=(2, 2))
(Pdb) n
> ./tests/test_core.py(201)test_numpy()
-> dy = da.from_array(y, chunks=(2,))
(Pdb)
> ./tests/test_core.py(202)test_numpy()
-> dbst = yield dxgb.train(c, param, dX, dy)
(Pdb)
[08:42:34] Tree method is automatically selected to be 'approx' for distributed training.[08:42:34
] Tree method is automatically selected to be 'approx' for distributed training.

> ./tests/test_core.py(203)test_numpy()
-> dbst = yield dxgb.train(c, param, dX, dy)  # we can do this twice
(Pdb)
[08:42:38] Tree method is automatically selected to be 'approx' for distributed training.[08:42:38
] Tree method is automatically selected to be 'approx' for distributed training.

> ./tests/test_core.py(205)test_numpy()
-> predictions = dxgb.predict(c, dbst, dX)
(Pdb)
rabit::Init is already called in this thread
  1. On seeing the comment python# workaround for "Doing rabit call after Finalize" in the test-case; I attempted to fix it with -

    
    @@ -179,6 +179,7 @@ def test_dmatrix_kwargs(c, s, a, b):
    
    def _test_container(dbst, predictions, X_type):
    +    xgb.rabit.init()  # workaround for "Doing rabit call after Finalize"
     dtrain = xgb.DMatrix(X_type(X), label=y)
     bst = xgb.train(param, dtrain)

@@ -195,7 +196,6 @@ def _test_container(dbst, predictions, X_type):

@gen_cluster(client=True, timeout=None, check_new_threads=False) def test_numpy(c, s, a, b):

and this particular test case worked fine, but it does not help me to fix failure with overall test script execution. That still fails like this -

$ pytest
======================================================================================== test session starts =========================================================================================
platform linux -- Python 3.6.8, pytest-4.6.2, py-1.8.0, pluggy-0.12.0 -- ./anaconda3/envs/test-dask-xgb/bin/python
cachedir: .pytest_cache
rootdir: ./sandbox/dask-xgboost, inifile: setup.cfg
plugins: cov-2.7.1, forked-1.0.2, xdist-1.28.0
[gw0] linux Python 3.6.8 cwd: ./sandbox/dask-xgboost/dask_xgboost/tests
[gw0] Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:34:02)  -- [GCC 7.3.0]
gw0 [12]
scheduling tests via LoadScheduling
[gw0] [  8%] PASSED test_core.py::test_basic
[gw0] [ 16%] PASSED test_core.py::test_dmatrix_kwargs
[gw0] [ 25%] FAILED test_core.py::test_numpy
[gw0] [ 33%] FAILED test_core.py::test_scipy_sparse
[gw0] [ 41%] FAILED test_core.py::test_sparse
[gw0] [ 50%] PASSED test_core.py::test_errors
[gw0] [ 58%] FAILED test_core.py::test_classifier
[gw0] [ 66%] FAILED test_core.py::test_multiclass_classifier
[gw0] [ 75%] FAILED test_core.py::test_classifier_multi[array]
[gw0] [ 83%] FAILED test_core.py::test_classifier_multi[dataframe]
[gw0] [ 91%] FAILED test_core.py::test_regressor
[gw0] [100%] FAILED test_core.py::test_synchronous_api ./anaconda3/envs/test-dask-xgb/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown
..
TomAugspurger commented 5 years ago

I've never really understood the issue unfortunately. I tried to fix this upstream in xgboost, but didn't get too far with it: https://github.com/dmlc/xgboost/issues/2796

On Mon, Jul 1, 2019 at 9:30 AM ksangeek notifications@github.com wrote:

I am using dask-xgboost 0.1.7 with xgboost 0.82. test_core.py::test_numpy was failing for me and I looked more into the failure and this is my understanding. I am bit amused as these tests were passing for me the last week and AFAIR with the same version of packages )! Need some help to understand what is going on here.

  1. test_core.py::test_numpy failed with rabit::Init is already called in this thread. And these are the details from pdb -

$ pytest test_core.py::test_numpy ====================================== test session starts ======================================= platform linux -- Python 3.6.8, pytest-4.6.2, py-1.8.0, pluggy-0.12.0 rootdir: ./tests plugins: cov-2.7.1, forked-1.0.2, xdist-1.28.0 collected 1 item

test_core.py

PDB set_trace (IO-capturing turned off) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ./tests/test_core.py(200)test_numpy() -> dX = da.from_array(X, chunks=(2, 2)) (Pdb) n> ./tests/test_core.py(201)test_numpy() -> dy = da.from_array(y, chunks=(2,)) (Pdb)> ./tests/test_core.py(202)test_numpy() -> dbst = yield dxgb.train(c, param, dX, dy) (Pdb) [08:42:34] Tree method is automatically selected to be 'approx' for distributed training.[08:42:34 ] Tree method is automatically selected to be 'approx' for distributed training. ./tests/test_core.py(203)test_numpy() -> dbst = yield dxgb.train(c, param, dX, dy) # we can do this twice (Pdb) [08:42:38] Tree method is automatically selected to be 'approx' for distributed training.[08:42:38 ] Tree method is automatically selected to be 'approx' for distributed training. ./tests/test_core.py(205)test_numpy() -> predictions = dxgb.predict(c, dbst, dX) (Pdb) rabit::Init is already called in this thread

  1. On seeing the comment python# workaround for "Doing rabit call after Finalize" in the test-case; I attempted to fix it with -

@@ -179,6 +179,7 @@ def test_dmatrix_kwargs(c, s, a, b):

def _test_container(dbst, predictions, X_type):+ xgb.rabit.init() # workaround for "Doing rabit call after Finalize" dtrain = xgb.DMatrix(X_type(X), label=y) bst = xgb.train(param, dtrain) @@ -195,7 +196,6 @@ def _test_container(dbst, predictions, X_type):

@gen_cluster(client=True, timeout=None, check_new_threads=False) def test_numpy(c, s, a, b):- xgb.rabit.init() # workaround for "Doing rabit call after Finalize" dX = da.from_array(X, chunks=(2, 2)) dy = da.from_array(y, chunks=(2,)) dbst = yield dxgb.train(c, param, dX, dy)

and this particular test case worked fine, but it does not help me to fix failure with overall test script execution. That still fails like this -

$ pytest ======================================================================================== test session starts ========================================================================================= platform linux -- Python 3.6.8, pytest-4.6.2, py-1.8.0, pluggy-0.12.0 -- ./anaconda3/envs/test-dask-xgb/bin/python cachedir: .pytest_cache rootdir: ./sandbox/dask-xgboost, inifile: setup.cfg plugins: cov-2.7.1, forked-1.0.2, xdist-1.28.0 [gw0] linux Python 3.6.8 cwd: ./sandbox/dask-xgboost/dask_xgboost/tests [gw0] Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:34:02) -- [GCC 7.3.0] gw0 [12] scheduling tests via LoadScheduling [gw0] [ 8%] PASSED test_core.py::test_basic [gw0] [ 16%] PASSED test_core.py::test_dmatrix_kwargs [gw0] [ 25%] FAILED test_core.py::test_numpy [gw0] [ 33%] FAILED test_core.py::test_scipy_sparse [gw0] [ 41%] FAILED test_core.py::test_sparse [gw0] [ 50%] PASSED test_core.py::test_errors [gw0] [ 58%] FAILED test_core.py::test_classifier [gw0] [ 66%] FAILED test_core.py::test_multiclass_classifier [gw0] [ 75%] FAILED test_core.py::test_classifier_multi[array] [gw0] [ 83%] FAILED test_core.py::test_classifier_multi[dataframe] [gw0] [ 91%] FAILED test_core.py::test_regressor [gw0] [100%] FAILED test_core.py::test_synchronous_api ./anaconda3/envs/test-dask-xgb/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown ..

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/47?email_source=notifications&email_token=AAKAOITHMB72MUW65FQK67LP5IIIBA5CNFSM4H4S76W2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G4U7BWQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIVVITKUPAD7IGPRCPLP5IIIBANCNFSM4H4S76WQ .

ksangeek commented 5 years ago

@TomAugspurger Thanks for the link to your attempt. Inferring from the comments in https://github.com/dask/dask-xgboost/issues/39#issuecomment-503338836 I am expecting that the work being done for low-level integration of dask in xgboost would not suffer from this issue.