dask / dask-xgboost

BSD 3-Clause "New" or "Revised" License
162 stars 43 forks source link

Dask worker dies during dask-xgboost classifier - no sparse package, fails only on GPU #71

Closed HennethAnnun closed 4 years ago

HennethAnnun commented 4 years ago

Dask worker dies after calling dask_xgboost.train method using GPU's (one or many). Same code runs fine on CPU.

Configuration used - RAPIDS docker image cuDF Version: 0.12.0b+2032.gab97331 Dask Version: 2.10.1 Dask cuDF Version: 0.12.0b+2032.gab97331 Dask XGBoost Version: 0.1.5 numpy Version: 1.18.1 pandas Version: 0.25.3 Scikit-Learn Version: 0.22.1

Description / Steps

# Local CUDA Cluster
from dask.distributed import Client
from dask.distributed import performance_report
from dask_cuda import LocalCUDACluster

# create a local CUDA cluster
cluster = LocalCUDACluster(silence_logs=False, n_workers=1)
client = Client(cluster)
client.run(cudf.set_allocator, "managed") 
client

distributed.scheduler - INFO - Clear task state distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:43901 distributed.scheduler - INFO - dashboard at: 127.0.0.1:8787 distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:36881' distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:41279', name: 0, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:41279 distributed.core - INFO - Starting established connection distributed.scheduler - INFO - Receive client connection: Client-b52dc8d4-6d07-11ea-b55f-0242ac110002 distributed.core - INFO - Starting established connection

Reading data from CSV: Pandas DF -> Dask DF -> Dask cudf Here data processing on single GPU / multi GPU works fine. Then:

# instantiate params
params = {}

# general params
general_params = {'silent': 0}
params.update(general_params)

# booster params
n_gpus = 1  
booster_params = {}
booster_params['max_depth'] = 4
booster_params['grow_policy'] = 'lossguide'
booster_params['max_leaves'] = 2**8
booster_params['tree_method'] = 'gpu_hist'
booster_params['num_class'] = n_categories   
booster_params['n_gpus'] = 1
params.update(booster_params)

# learning task params
learning_task_params = {}
learning_task_params['eval_metric'] = 'mlogloss'
learning_task_params['objective'] = 'multi:softmax'
params.update(learning_task_params)

# model training settings
num_round = 100

# This was GPU training
bst = dask_xgboost.train(client, params, X_train_dask_cudf, y_train_dask_cudf, num_boost_round=num_round)

distributed.worker - INFO - Run out-of-band function 'start_tracker' distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:41279', name: 0, memory: 12, processing: 1> distributed.core - INFO - Removing comms to tcp://127.0.0.1:41279 distributed.scheduler - INFO - Lost all workers distributed.nanny - INFO - Worker process 13746 was killed by unknown signal distributed.nanny - WARNING - Restarting worker distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33803', name: 0, memory: 0, processing: 8> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33803 distributed.core - INFO - Starting established connection

The train_part indicates "no-worker" and just hangs there.

TomAugspurger commented 4 years ago

Where in train_part does it fail? IIRC we run a function on the data to check if its a sparse dtype. Can you try running just that function?

HennethAnnun commented 4 years ago

Data is all int64. No NA'ns. How to debug this / get the logs? Could you help me with this?

TomAugspurger commented 4 years ago

I don't recall if there's logging in this area. You might have to insert print statements / breakpoints.

On Mon, Mar 23, 2020 at 8:51 AM HennethAnnun notifications@github.com wrote:

Data is all int64. No NA'ns. How to debug this / get the logs? Could you help me with this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/71#issuecomment-602602407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ6WJIVYGST4LEHTFDRI5SPDANCNFSM4LR4BCQQ .

HennethAnnun commented 4 years ago

I don't recall if there's logging in this area. You might have to insert print statements / breakpoints. On Mon, Mar 23, 2020 at 8:51 AM HennethAnnun @.***> wrote: Data is all int64. No NA'ns. How to debug this / get the logs? Could you help me with this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#71 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ6WJIVYGST4LEHTFDRI5SPDANCNFSM4LR4BCQQ .

Hm... that is tough :-)

Here: https://github.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py Line 228 I get here no results

Here is my crafted DEBUG ;-) :

2020-03-23 19:35:38,701 - numba.cuda.cudadrv.driver - DEBUG - call driver api: cuCtxGetDevice 2020-03-23 19:35:39,125 - dask_xgboost.core - INFO - train function called 2020-03-23 19:35:39,126 - dask_xgboost.core - INFO - _train function called 2020-03-23 19:35:40,614 - dask_xgboost.core - INFO - defaultdict(<class 'list'>, {'tcp://127.0.0.1:39099': [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>]}) 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - ncores {'tcp://127.0.0.1:39099': 1} 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - host 127.0.0.1 port 44763 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - 127.0.0.1 2020-03-23 19:35:40,620 - root - INFO - start listen on 127.0.0.1:9091 2020-03-23 19:35:40,621 - dask_xgboost.core - INFO - Starting Rabit Tracker dupa 2020-03-23 19:35:40,623 - dask_xgboost.core - INFO - {'DMLC_NUM_WORKER': 1, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091} 2020-03-23 19:35:40,624 - dask_xgboost.core - INFO - -------- futures ---------- 2020-03-23 19:35:40,624 - dask_xgboost.core - INFO - [<Future: pending, key: train_part-37f97f505017df6b7f2079767cafc8f0>] 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- futures ---------- 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- worker_map ------- 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - dict_items([('tcp://127.0.0.1:39099', [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>])]) 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- worker_map ------- 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - train_part ------ <function train_part at 0x7fdf833c3598> 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - env ------------- {'DMLC_NUM_WORKER': 1, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091} 2020-03-23 19:35:40,626 - dask_xgboost.core - INFO - worker --------tcp://127.0.0.1:39099 ------- list_of_parts --------- [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>] 2020-03-23 19:35:40,626 - dask_xgboost.core - INFO - params -------------- {'silent': 0, 'max_depth': 4, 'grow_policy': 'lossguide', 'max_leaves': 256, 'tree_method': 'gpu_hist', 'num_class': 808, 'n_gpus': 1, 'eval_metric': 'mlogloss', 'objective': 'multi:softmax'} 2020-03-23 19:35:41,296 - root - DEBUG - Recieve start signal from 127.0.0.1; assign rank 0 2020-03-23 19:35:41,296 - root - INFO - @tracker All of 1 nodes getting started

What do you think?

TomAugspurger commented 4 years ago

Nothing stands out to me. What happens if you manually submit a train_part to a worker?

On Mon, Mar 23, 2020 at 2:39 PM HennethAnnun notifications@github.com wrote:

I don't recall if there's logging in this area. You might have to insert print statements / breakpoints. … <#m3855275909874318662> On Mon, Mar 23, 2020 at 8:51 AM HennethAnnun @.***> wrote: Data is all int64. No NA'ns. How to debug this / get the logs? Could you help me with this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#71 (comment) https://github.com/dask/dask-xgboost/issues/71#issuecomment-602602407>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ6WJIVYGST4LEHTFDRI5SPDANCNFSM4LR4BCQQ .

Hm... that is tough :-)

Here: https://github.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py Line 228 I get here no results

Here is my crafted DEBUG ;-) :

2020-03-23 19:35:38,701 - numba.cuda.cudadrv.driver - DEBUG - call driver api: cuCtxGetDevice 2020-03-23 19:35:39,125 - dask_xgboost.core - INFO - train function called 2020-03-23 19:35:39,126 - dask_xgboost.core - INFO - _train function called 2020-03-23 19:35:40,614 - dask_xgboost.core - INFO - defaultdict(<class 'list'>, {'tcp://127.0.0.1:39099': [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>]}) 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - ncores {'tcp:// 127.0.0.1:39099': 1} 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - host 127.0.0.1 port 44763 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - 127.0.0.1 2020-03-23 19:35:40,620 - root - INFO - start listen on 127.0.0.1:9091 2020-03-23 19:35:40,621 - dask_xgboost.core - INFO - Starting Rabit Tracker dupa 2020-03-23 19:35:40,623 - dask_xgboost.core - INFO - {'DMLC_NUM_WORKER': 1, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091} 2020-03-23 19:35:40,624 - dask_xgboost.core - INFO - -------- futures

2020-03-23 19:35:40,624 - dask_xgboost.core - INFO - [<Future: pending, key: train_part-37f97f505017df6b7f2079767cafc8f0>] 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- futures

2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- worker_map

2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - dict_items([('tcp:// 127.0.0.1:39099', [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>])]) 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- worker_map

2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - train_part ------ <function train_part at 0x7fdf833c3598> 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - env ------------- {'DMLC_NUM_WORKER': 1, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091} 2020-03-23 19:35:40,626 - dask_xgboost.core - INFO - worker --------tcp:// 127.0.0.1:39099 ------- list_of_parts --------- [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>] 2020-03-23 19:35:40,626 - dask_xgboost.core - INFO - params -------------- {'silent': 0, 'max_depth': 4, 'grow_policy': 'lossguide', 'max_leaves': 256, 'tree_method': 'gpu_hist', 'num_class': 808, 'n_gpus': 1, 'eval_metric': 'mlogloss', 'objective': 'multi:softmax'} 2020-03-23 19:35:41,296 - root - DEBUG - Recieve start signal from 127.0.0.1; assign rank 0 2020-03-23 19:35:41,296 - root - INFO - @TracKer https://github.com/TracKer All of 1 nodes getting started

What do you think?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/71#issuecomment-602814177, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIV2V6P7M4I7B4SDKTDRI63GFANCNFSM4LR4BCQQ .

HennethAnnun commented 4 years ago

Hi Tom, thank you for your time. I have stripped train_part and run manually conversion from pandas to DMatrix, then only rabbit and xgb training on single GPU. GPU memmory got filled. It happened so fast I did not catch it. Same happened on the local cluster.

Debugging this is a challenge, but I learned a lot. Thx.