Closed HennethAnnun closed 4 years ago
Where in train_part
does it fail? IIRC we run a function on the data to check if its a sparse dtype. Can you try running just that function?
Data is all int64. No NA'ns. How to debug this / get the logs? Could you help me with this?
I don't recall if there's logging in this area. You might have to insert print statements / breakpoints.
On Mon, Mar 23, 2020 at 8:51 AM HennethAnnun notifications@github.com wrote:
Data is all int64. No NA'ns. How to debug this / get the logs? Could you help me with this?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/71#issuecomment-602602407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ6WJIVYGST4LEHTFDRI5SPDANCNFSM4LR4BCQQ .
I don't recall if there's logging in this area. You might have to insert print statements / breakpoints. … On Mon, Mar 23, 2020 at 8:51 AM HennethAnnun @.***> wrote: Data is all int64. No NA'ns. How to debug this / get the logs? Could you help me with this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#71 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ6WJIVYGST4LEHTFDRI5SPDANCNFSM4LR4BCQQ .
Hm... that is tough :-)
Here: https://github.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py Line 228 I get here no results
Here is my crafted DEBUG ;-) :
2020-03-23 19:35:38,701 - numba.cuda.cudadrv.driver - DEBUG - call driver api: cuCtxGetDevice 2020-03-23 19:35:39,125 - dask_xgboost.core - INFO - train function called 2020-03-23 19:35:39,126 - dask_xgboost.core - INFO - _train function called 2020-03-23 19:35:40,614 - dask_xgboost.core - INFO - defaultdict(<class 'list'>, {'tcp://127.0.0.1:39099': [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>]}) 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - ncores {'tcp://127.0.0.1:39099': 1} 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - host 127.0.0.1 port 44763 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - 127.0.0.1 2020-03-23 19:35:40,620 - root - INFO - start listen on 127.0.0.1:9091 2020-03-23 19:35:40,621 - dask_xgboost.core - INFO - Starting Rabit Tracker dupa 2020-03-23 19:35:40,623 - dask_xgboost.core - INFO - {'DMLC_NUM_WORKER': 1, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091} 2020-03-23 19:35:40,624 - dask_xgboost.core - INFO - -------- futures ---------- 2020-03-23 19:35:40,624 - dask_xgboost.core - INFO - [<Future: pending, key: train_part-37f97f505017df6b7f2079767cafc8f0>] 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- futures ---------- 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- worker_map ------- 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - dict_items([('tcp://127.0.0.1:39099', [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>])]) 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- worker_map ------- 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - train_part ------ <function train_part at 0x7fdf833c3598> 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - env ------------- {'DMLC_NUM_WORKER': 1, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091} 2020-03-23 19:35:40,626 - dask_xgboost.core - INFO - worker --------tcp://127.0.0.1:39099 ------- list_of_parts --------- [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>] 2020-03-23 19:35:40,626 - dask_xgboost.core - INFO - params -------------- {'silent': 0, 'max_depth': 4, 'grow_policy': 'lossguide', 'max_leaves': 256, 'tree_method': 'gpu_hist', 'num_class': 808, 'n_gpus': 1, 'eval_metric': 'mlogloss', 'objective': 'multi:softmax'} 2020-03-23 19:35:41,296 - root - DEBUG - Recieve start signal from 127.0.0.1; assign rank 0 2020-03-23 19:35:41,296 - root - INFO - @tracker All of 1 nodes getting started
What do you think?
Nothing stands out to me. What happens if you manually submit a train_part to a worker?
On Mon, Mar 23, 2020 at 2:39 PM HennethAnnun notifications@github.com wrote:
I don't recall if there's logging in this area. You might have to insert print statements / breakpoints. … <#m3855275909874318662> On Mon, Mar 23, 2020 at 8:51 AM HennethAnnun @.***> wrote: Data is all int64. No NA'ns. How to debug this / get the logs? Could you help me with this? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#71 (comment) https://github.com/dask/dask-xgboost/issues/71#issuecomment-602602407>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQ6WJIVYGST4LEHTFDRI5SPDANCNFSM4LR4BCQQ .
Hm... that is tough :-)
Here: https://github.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py Line 228 I get here no results
Here is my crafted DEBUG ;-) :
2020-03-23 19:35:38,701 - numba.cuda.cudadrv.driver - DEBUG - call driver api: cuCtxGetDevice 2020-03-23 19:35:39,125 - dask_xgboost.core - INFO - train function called 2020-03-23 19:35:39,126 - dask_xgboost.core - INFO - _train function called 2020-03-23 19:35:40,614 - dask_xgboost.core - INFO - defaultdict(<class 'list'>, {'tcp://127.0.0.1:39099': [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>]}) 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - ncores {'tcp:// 127.0.0.1:39099': 1} 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - host 127.0.0.1 port 44763 2020-03-23 19:35:40,616 - dask_xgboost.core - INFO - 127.0.0.1 2020-03-23 19:35:40,620 - root - INFO - start listen on 127.0.0.1:9091 2020-03-23 19:35:40,621 - dask_xgboost.core - INFO - Starting Rabit Tracker dupa 2020-03-23 19:35:40,623 - dask_xgboost.core - INFO - {'DMLC_NUM_WORKER': 1, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091} 2020-03-23 19:35:40,624 - dask_xgboost.core - INFO - -------- futures
2020-03-23 19:35:40,624 - dask_xgboost.core - INFO - [<Future: pending, key: train_part-37f97f505017df6b7f2079767cafc8f0>] 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- futures
2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- worker_map
2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - dict_items([('tcp:// 127.0.0.1:39099', [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>])]) 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - -------- worker_map
2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - train_part ------ <function train_part at 0x7fdf833c3598> 2020-03-23 19:35:40,625 - dask_xgboost.core - INFO - env ------------- {'DMLC_NUM_WORKER': 1, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091} 2020-03-23 19:35:40,626 - dask_xgboost.core - INFO - worker --------tcp:// 127.0.0.1:39099 ------- list_of_parts --------- [<Future: finished, type: builtins.tuple, key: tuple-3a005762-d0b1-4de0-9f0b-e05aa9583929>] 2020-03-23 19:35:40,626 - dask_xgboost.core - INFO - params -------------- {'silent': 0, 'max_depth': 4, 'grow_policy': 'lossguide', 'max_leaves': 256, 'tree_method': 'gpu_hist', 'num_class': 808, 'n_gpus': 1, 'eval_metric': 'mlogloss', 'objective': 'multi:softmax'} 2020-03-23 19:35:41,296 - root - DEBUG - Recieve start signal from 127.0.0.1; assign rank 0 2020-03-23 19:35:41,296 - root - INFO - @TracKer https://github.com/TracKer All of 1 nodes getting started
What do you think?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/71#issuecomment-602814177, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIV2V6P7M4I7B4SDKTDRI63GFANCNFSM4LR4BCQQ .
Hi Tom, thank you for your time. I have stripped train_part and run manually conversion from pandas to DMatrix, then only rabbit and xgb training on single GPU. GPU memmory got filled. It happened so fast I did not catch it. Same happened on the local cluster.
Debugging this is a challenge, but I learned a lot. Thx.
Dask worker dies after calling dask_xgboost.train method using GPU's (one or many). Same code runs fine on CPU.
Configuration used - RAPIDS docker image cuDF Version: 0.12.0b+2032.gab97331 Dask Version: 2.10.1 Dask cuDF Version: 0.12.0b+2032.gab97331 Dask XGBoost Version: 0.1.5 numpy Version: 1.18.1 pandas Version: 0.25.3 Scikit-Learn Version: 0.22.1
Description / Steps
distributed.scheduler - INFO - Clear task state distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:43901 distributed.scheduler - INFO - dashboard at: 127.0.0.1:8787 distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:36881' distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:41279', name: 0, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:41279 distributed.core - INFO - Starting established connection distributed.scheduler - INFO - Receive client connection: Client-b52dc8d4-6d07-11ea-b55f-0242ac110002 distributed.core - INFO - Starting established connection
Reading data from CSV: Pandas DF -> Dask DF -> Dask cudf Here data processing on single GPU / multi GPU works fine. Then:
distributed.worker - INFO - Run out-of-band function 'start_tracker' distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:41279', name: 0, memory: 12, processing: 1> distributed.core - INFO - Removing comms to tcp://127.0.0.1:41279 distributed.scheduler - INFO - Lost all workers distributed.nanny - INFO - Worker process 13746 was killed by unknown signal distributed.nanny - WARNING - Restarting worker distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33803', name: 0, memory: 0, processing: 8> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33803 distributed.core - INFO - Starting established connection
The train_part indicates "no-worker" and just hangs there.