[jvm-packages] Checkpointing performance issue in XGBoost4J-Spark

alois-bissuel commented 5 years ago

I just noticed a weird performance issue when using checkpoints in the Spark wrapper for XGBoost. Before the first checkpoint, the learning time is constant per tree, but after the first checkpoint, the time needed to learn a new tree is much higher, and it grows with the number of trees.

Here is the parameter map used by XGBoost: Map(alpha -> 0.0, min_child_weight -> 1.0, sample_type -> uniform, base_score -> 0.5, colsample_bylevel -> 1.0, grow_policy -> depthwise, skip_drop -> 0.0, lambda_bias -> 0.0, silent -> 0, scale_pos_weight -> 1.0, seed -> 0, features_col -> features, num_early_stopping_rounds -> 0, label_col -> label, num_workers -> 5, subsample -> 1.0, lambda -> 1.0, max_depth -> 5, probability_col -> probability, raw_prediction_col -> rawPrediction, tree_limit -> 0, custom_eval -> null, rate_drop -> 0.0, max_bin -> 16, train_test_ratio -> 1.0, use_external_memory -> false, objective -> binary:logistic, eval_metric -> logloss, num_round -> 500, timeout_request_workers -> 1800000, missing -> NaN, checkpoint_path -> viewfs://root/user/XXX/YYYY/checkpoint, tracker_conf -> TrackerConf(0,python), tree_method -> auto, max_delta_step -> 0.0, eta -> 0.3, colsample_bytree -> 1.0, normalize_type -> tree, custom_obj -> null, gamma -> 0.0, sketch_eps -> 0.03, nthread -> 4, prediction_col -> prediction, checkpoint_interval -> 100)

Attached is a small graph, where this issue is shown on the same dataset, with different checkpointing number. Please disregard the discrepancy between the two run, as they might not have been launched with the same multithreading parameters.

A quick profiling (using async-profiler) before and after the checkpoint might give a hint to the problem. Before the checkpoint, I see many calls to tree::CQHistMaker or tree::GlobalProposalHistMaker, but after the checkpoint point, the C++ calls are all to CPUPredictor::PredLoopSpecialize. I do not know whether this means that the method used for learning is not the same after the checkpoint, or that the time needed to evaluate the tree is so long that it the profiler only sees it. I could upload some flamegraphs, but I would need first to make sure they are actually representative of what is going on.

chenqin commented 5 years ago

IMO, there were unnecessary reshuffle after each checkpoint. Is there any info around memory utilization?

CodingCat commented 5 years ago

@cq not true, extra shuffle happens for every 100 iterations in this case..

alois-bissuel commented 5 years ago

@chenqin and @CodingCat , do you need any other "benchmark" ?

chenqin commented 5 years ago

Okay, I think I can add some info this issue on my side with different experimentation dataset.

2B rows, 26 features (including some numerical ones) doing regression on max depth 20 trees with MAE evaluator, checkpoint every 10 iterations. The experiments runs on 80 spark executors with 8VCORE and 220G (spark.task.cpus = 4, numThread = 3, numWorkers = 160)

[2] shows per iteration time get slower from around 6 mins to 13 mins. What's also interesting is at same time the driver side memory pressure keeps increasing every checkpoint( green line at [1]).

[1] screen shot 2018-12-20 at 11 14 18

[2] [06:38:22] Tree method is automatically selected to be 'approx' for distributed training. [06:38:22] Tree method is automatically selected to be 'approx' for distributed training. [06:44:24] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 993336 extra nodes, 0 pruned nodes, max_depth=20 [06:44:24] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 993336 extra nodes, 0 pruned nodes, max_depth=20 [06:50:58] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1018830 extra nodes, 0 pruned nodes, max_depth=20 [06:50:58] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1018830 extra nodes, 0 pruned nodes, max_depth=20 [06:57:19] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1046772 extra nodes, 0 pruned nodes, max_depth=20 [06:57:19] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1046772 extra nodes, 0 pruned nodes, max_depth=20 [07:03:52] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1045244 extra nodes, 0 pruned nodes, max_depth=20 [07:03:52] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1045244 extra nodes, 0 pruned nodes, max_depth=20 [07:10:50] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1051044 extra nodes, 0 pruned nodes, max_depth=20 [07:10:50] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1051044 extra nodes, 0 pruned nodes, max_depth=20 [07:17:57] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1056348 extra nodes, 0 pruned nodes, max_depth=20 [07:17:57] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1056348 extra nodes, 0 pruned nodes, max_depth=20 [07:25:04] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1034246 extra nodes, 0 pruned nodes, max_depth=20 [07:25:05] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1034246 extra nodes, 0 pruned nodes, max_depth=20 [07:31:58] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 996690 extra nodes, 0 pruned nodes, max_depth=20 [07:31:58] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 996690 extra nodes, 0 pruned nodes, max_depth=20 [07:39:27] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1030580 extra nodes, 0 pruned nodes, max_depth=20 [07:39:27] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1030580 extra nodes, 0 pruned nodes, max_depth=20 [07:47:02] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 985686 extra nodes, 0 pruned nodes, max_depth=20 [07:47:02] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 985686 extra nodes, 0 pruned nodes, max_depth=20 [07:50:04] Tree method is automatically selected to be 'approx' for distributed training. [07:50:04] Tree method is automatically selected to be 'approx' for distributed training. [07:56:54] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1002140 extra nodes, 0 pruned nodes, max_depth=20 [07:56:55] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1002140 extra nodes, 0 pruned nodes, max_depth=20 [08:05:26] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 983358 extra nodes, 0 pruned nodes, max_depth=20 [08:05:27] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 983358 extra nodes, 0 pruned nodes, max_depth=20 [08:13:05] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 939796 extra nodes, 0 pruned nodes, max_depth=20 [08:13:05] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 939796 extra nodes, 0 pruned nodes, max_depth=20 [08:21:13] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 947240 extra nodes, 0 pruned nodes, max_depth=20 [08:21:14] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 947240 extra nodes, 0 pruned nodes, max_depth=20 [08:29:11] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 918900 extra nodes, 0 pruned nodes, max_depth=20 [08:29:11] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 918900 extra nodes, 0 pruned nodes, max_depth=20 [08:37:31] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 886702 extra nodes, 0 pruned nodes, max_depth=20 [08:37:31] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 886702 extra nodes, 0 pruned nodes, max_depth=20 [08:45:49] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 863450 extra nodes, 0 pruned nodes, max_depth=20 [08:45:49] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 863450 extra nodes, 0 pruned nodes, max_depth=20 [08:54:24] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 890440 extra nodes, 0 pruned nodes, max_depth=20 [08:54:24] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 890440 extra nodes, 0 pruned nodes, max_depth=20 [09:02:45] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 887092 extra nodes, 0 pruned nodes, max_depth=20 [09:02:45] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 887092 extra nodes, 0 pruned nodes, max_depth=20 [09:11:24] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 826454 extra nodes, 0 pruned nodes, max_depth=20 [09:11:24] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 826454 extra nodes, 0 pruned nodes, max_depth=20 [09:15:29] Tree method is automatically selected to be 'approx' for distributed training. [09:15:29] Tree method is automatically selected to be 'approx' for distributed training. [09:22:45] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 813594 extra nodes, 0 pruned nodes, max_depth=20 [09:22:46] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 813594 extra nodes, 0 pruned nodes, max_depth=20 [09:31:36] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 838276 extra nodes, 0 pruned nodes, max_depth=20 [09:31:36] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 838276 extra nodes, 0 pruned nodes, max_depth=20 [09:40:20] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 789644 extra nodes, 0 pruned nodes, max_depth=20 [09:40:20] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 789644 extra nodes, 0 pruned nodes, max_depth=20 [09:49:41] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 814340 extra nodes, 0 pruned nodes, max_depth=20 [09:49:41] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 814340 extra nodes, 0 pruned nodes, max_depth=20 [09:58:57] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 745168 extra nodes, 0 pruned nodes, max_depth=20 [09:58:57] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 745168 extra nodes, 0 pruned nodes, max_depth=20 [10:08:28] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 799576 extra nodes, 0 pruned nodes, max_depth=20 [10:08:28] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 799576 extra nodes, 0 pruned nodes, max_depth=20 [10:17:42] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 729830 extra nodes, 0 pruned nodes, max_depth=20 [10:17:42] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 729830 extra nodes, 0 pruned nodes, max_depth=20 [10:26:59] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 782810 extra nodes, 0 pruned nodes, max_depth=20 [10:27:00] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 782810 extra nodes, 0 pruned nodes, max_depth=20 [10:36:22] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 727454 extra nodes, 0 pruned nodes, max_depth=20 [10:36:22] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 727454 extra nodes, 0 pruned nodes, max_depth=20 [10:45:59] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 740738 extra nodes, 0 pruned nodes, max_depth=20 [10:46:00] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 740738 extra nodes, 0 pruned nodes, max_depth=20 [10:51:01] Tree method is automatically selected to be 'approx' for distributed training. [10:51:01] Tree method is automatically selected to be 'approx' for distributed training. [10:58:09] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 752464 extra nodes, 0 pruned nodes, max_depth=20 [10:58:09] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 752464 extra nodes, 0 pruned nodes, max_depth=20 [11:08:21] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 665380 extra nodes, 0 pruned nodes, max_depth=20 [11:08:21] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 665380 extra nodes, 0 pruned nodes, max_depth=20 [11:18:48] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 685176 extra nodes, 0 pruned nodes, max_depth=20 [11:18:48] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 685176 extra nodes, 0 pruned nodes, max_depth=20 [11:28:48] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 643578 extra nodes, 0 pruned nodes, max_depth=20 [11:28:48] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 643578 extra nodes, 0 pruned nodes, max_depth=20 [11:39:08] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 649784 extra nodes, 0 pruned nodes, max_depth=20 [11:39:08] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 649784 extra nodes, 0 pruned nodes, max_depth=20 [11:49:47] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 656458 extra nodes, 0 pruned nodes, max_depth=20 [11:49:47] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 656458 extra nodes, 0 pruned nodes, max_depth=20 [12:00:18] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 668894 extra nodes, 0 pruned nodes, max_depth=20 [12:00:18] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 668894 extra nodes, 0 pruned nodes, max_depth=20 [12:11:21] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 620048 extra nodes, 0 pruned nodes, max_depth=20 [12:11:21] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 620048 extra nodes, 0 pruned nodes, max_depth=20 [12:22:19] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 659902 extra nodes, 0 pruned nodes, max_depth=20 [12:22:19] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 659902 extra nodes, 0 pruned nodes, max_depth=20 [12:33:10] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 580146 extra nodes, 0 pruned nodes, max_depth=20 [12:33:11] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 580146 extra nodes, 0 pruned nodes, max_depth=20 [12:39:07] Tree method is automatically selected to be 'approx' for distributed training. [12:39:07] Tree method is automatically selected to be 'approx' for distributed training. [12:47:01] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 466660 extra nodes, 0 pruned nodes, max_depth=20 [12:47:02] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 466660 extra nodes, 0 pruned nodes, max_depth=20 [12:58:27] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 530932 extra nodes, 0 pruned nodes, max_depth=20 [12:58:27] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 530932 extra nodes, 0 pruned nodes, max_depth=20 [13:09:31] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 434816 extra nodes, 0 pruned nodes, max_depth=20 [13:09:32] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 434816 extra nodes, 0 pruned nodes, max_depth=20 [13:20:48] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 440214 extra nodes, 0 pruned nodes, max_depth=20 [13:20:48] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 440214 extra nodes, 0 pruned nodes, max_depth=20 [13:32:11] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 499660 extra nodes, 0 pruned nodes, max_depth=20 [13:32:11] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 499660 extra nodes, 0 pruned nodes, max_depth=20 [13:43:53] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 570746 extra nodes, 0 pruned nodes, max_depth=20 [13:43:53] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 570746 extra nodes, 0 pruned nodes, max_depth=20 [13:55:29] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 446036 extra nodes, 0 pruned nodes, max_depth=20 [13:55:29] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 446036 extra nodes, 0 pruned nodes, max_depth=20 [14:07:27] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 510108 extra nodes, 0 pruned nodes, max_depth=20 [14:07:27] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 510108 extra nodes, 0 pruned nodes, max_depth=20 [14:19:26] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 469160 extra nodes, 0 pruned nodes, max_depth=20 [14:19:26] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 469160 extra nodes, 0 pruned nodes, max_depth=20 [14:31:23] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 423070 extra nodes, 0 pruned nodes, max_depth=20 [14:31:23] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 423070 extra nodes, 0 pruned nodes, max_depth=20 [14:37:55] Tree method is automatically selected to be 'approx' for distributed training. [14:37:55] Tree method is automatically selected to be 'approx' for distributed training. [14:46:29] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 513778 extra nodes, 0 pruned nodes, max_depth=20 [14:46:30] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 513778 extra nodes, 0 pruned nodes, max_depth=20 [14:58:10] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 554968 extra nodes, 0 pruned nodes, max_depth=20 [14:58:10] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 554968 extra nodes, 0 pruned nodes, max_depth=20 [15:10:23] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 456476 extra nodes, 0 pruned nodes, max_depth=20 [15:10:23] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 456476 extra nodes, 0 pruned nodes, max_depth=20 [15:22:48] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 491384 extra nodes, 0 pruned nodes, max_depth=20 [15:22:49] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 491384 extra nodes, 0 pruned nodes, max_depth=20 [15:35:16] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 485296 extra nodes, 0 pruned nodes, max_depth=20 [15:35:16] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 485296 extra nodes, 0 pruned nodes, max_depth=20 [15:47:48] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 449818 extra nodes, 0 pruned nodes, max_depth=20 [15:47:48] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 449818 extra nodes, 0 pruned nodes, max_depth=20 [15:59:47] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 400004 extra nodes, 0 pruned nodes, max_depth=20 [15:59:47] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 400004 extra nodes, 0 pruned nodes, max_depth=20 [16:12:36] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 466620 extra nodes, 0 pruned nodes, max_depth=20 [16:12:37] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 466620 extra nodes, 0 pruned nodes, max_depth=20 [16:24:40] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 405946 extra nodes, 0 pruned nodes, max_depth=20 [16:24:40] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 405946 extra nodes, 0 pruned nodes, max_depth=20 [16:36:46] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 402508 extra nodes, 0 pruned nodes, max_depth=20 [16:36:46] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 402508 extra nodes, 0 pruned nodes, max_depth=20 [16:44:00] Tree method is automatically selected to be 'approx' for distributed training. [16:44:00] Tree method is automatically selected to be 'approx' for distributed training. [16:52:05] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 341324 extra nodes, 0 pruned nodes, max_depth=20 [16:52:06] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 341324 extra nodes, 0 pruned nodes, max_depth=20 [17:05:00] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 358984 extra nodes, 0 pruned nodes, max_depth=20 [17:05:00] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 358984 extra nodes, 0 pruned nodes, max_depth=20 [17:18:03] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 357420 extra nodes, 0 pruned nodes, max_depth=20 [17:18:03] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 357420 extra nodes, 0 pruned nodes, max_depth=20 [17:30:44] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 341556 extra nodes, 0 pruned nodes, max_depth=20 [17:30:44] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 341556 extra nodes, 0 pruned nodes, max_depth=20 [17:43:55] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 427186 extra nodes, 0 pruned nodes, max_depth=20 [17:43:55] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 427186 extra nodes, 0 pruned nodes, max_depth=20 [17:56:50] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 292752 extra nodes, 0 pruned nodes, max_depth=20 [17:56:50] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 292752 extra nodes, 0 pruned nodes, max_depth=20 [18:10:05] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 365336 extra nodes, 0 pruned nodes, max_depth=20 [18:10:05] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 365336 extra nodes, 0 pruned nodes, max_depth=20 [18:23:11] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 258840 extra nodes, 0 pruned nodes, max_depth=20 [18:23:12] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 258840 extra nodes, 0 pruned nodes, max_depth=20 [18:36:24] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 334262 extra nodes, 0 pruned nodes, max_depth=20 [18:36:24] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 334262 extra nodes, 0 pruned nodes, max_depth=20 [18:49:39] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 314544 extra nodes, 0 pruned nodes, max_depth=20 [18:49:39] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 314544 extra nodes, 0 pruned nodes, max_depth=20 [18:57:22] Tree method is automatically selected to be 'approx' for distributed training. [18:57:23] Tree method is automatically selected to be 'approx' for distributed training. [19:06:08] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 318820 extra nodes, 0 pruned nodes, max_depth=20 [19:06:08] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 318820 extra nodes, 0 pruned nodes, max_depth=20 [19:19:11] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 273578 extra nodes, 0 pruned nodes, max_depth=20 [19:19:11] /home/cq/xgboost4j-spark-scalability/xgboost_upstream/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 273578 extra nodes, 0 pruned nodes, max_depth=20

AnthonyTruchet commented 5 years ago

Hello, has this question raised any interest? we are willing to investigate and contribute and for that wanted to check first that no update / fix existed... Is anyone watching/involved?

hcho3 commented 5 years ago

@CodingCat Is anyone in your org looking at this issue?

CodingCat commented 5 years ago

no, please go ahead, we do not have bandwidth to take care of it for now

bourbaki commented 5 years ago

I have started investigating the issue. And it looks like I found one possible cause of it. The prediction caches are not initialized after the first checkpoint (gbm_.predictor_.cache_). And if the caches have not been initialized they will not be filled. So every iteration all predictions for every previous tree are recomputed. We are losing caches when propagating the booster to the driver. There are two possible solutions:

Reinitialize the caches after starting from checkpoint. It will solve the part of the problem. The cost is that we need to recompute caches on the first iteration after every checkpoint.
Propagate the caches with a booster. I think there is some kind of issue with serialization of C++ object. I am preparing a pull request.

CodingCat commented 5 years ago

@bourbaki thanks for the analysis and I agree with it

regarding how to resolve it, I think those two solutions have their own problems as you said

the first iteration right after the checkpoint would be super slow, and it becomes slower and slower with the growth of # of trees
the problem is we have to change the serialization format of booster which may hurt the other bindings of xgboost, and furthermore, the serialized booster would be in a huge size if the prediction results are happen to be very large

the ideal fix should be more fundamental (on how we do checkpoint and needing more native apis)...I got some ideas and am about to start my work...but I just noticed you added I am preparing a pull request after I read your comments last time

please let me know your proposal before you submit the code to avoid the waste of energy, and I am actually in a hurry to get this fixed due to an internally raised request

CodingCat commented 5 years ago

@bourbaki if you are dedicated to this, I am more than happy to assign this to you and work with you to get it fixed and merged

bourbaki commented 5 years ago

@CodingCat Yes. I would be glad to help you with it. I agree with all of the objections :). What approach would you suggest?

CodingCat commented 5 years ago

@bourbaki sorry for the late reply, really busy in these days, my suggestion would be

since the checkpoint is a functionality built in xgboost4j-spark layer, we can just pursue a solution in this level,
the point is to get cache initialized with a booster from checkpoint, I would suggest we

2.1 save the prediction results with Spark

2.2 expose an API to directly call Init Function in Predictor

2.3 when starting training from a checkpoint, we should load back data (maybe have to through HDFS API directly) saved in 2.1, and use a resembled version of PredLoopSpecalize to update the cache with the loaded Dataset

the most important thing we need to be careful is to guarantee the deterministic partitioning

what do you think?

bourbaki commented 5 years ago

It sounds good. But there's one thing that bothers me with the current design of xgboost4k-spark. At every checkpoint we collect booster data from executors to a driver to save it. But then to start the next iteration instead of reusing the current state of executors we copy a booster from the driver to each executor. These copies are redundant in the case when there are no failures in the training proces and we just want to get the current state of a booster without interrupting the training process. It would be cool to have something like this:

val boosters1: RDD[Booster] = train(..., iterations=100).cache
val boosterToSave = boosters1.first
boosterToSave.save(...)
val boosters2 = continueTrain(boosters1, iterations=100).cache
....

In this case when training goes according to the plan we only copy booster to driver and then restart the training from the cached state.

CodingCat commented 5 years ago

It sounds good. But there's one thing that bothers me with the current design of xgboost4k-spark. At every checkpoint we collect booster data from executors to a driver to save it. But then to start the next iteration instead of reusing the current state of executors we copy a booster from the driver to each executor. These copies are redundant in the case when there are no failures in the training proces and we just want to get the current state of a booster without interrupting the training process. It would be cool to have something like this:
val boosters1: RDD[Booster] = train(..., iterations=100).cache
val boosterToSave = boosters1.first
boosterToSave.save(...)
val boosters2 = continueTrain(boosters1, iterations=100).cache
....
In this case when training goes according to the plan we only copy booster to driver and then restart the training from the cached state.

yes, in that way we need some way to pass the message from driver to executor saying "please move on"...leading us to more complexities from the message system for training...

if we want to pursue that approach, we need to extend rabit layer to allow tracker and worker to communicate more than "start/stop/I died/please reborn", etc.

CodingCat commented 5 years ago

but still we need to save the prediction results to allow the training recover from a checkpoint..which does not reduce the complexity I talked about earlier

CodingCat commented 5 years ago

@bourbaki are you working on this?

CodingCat commented 5 years ago

ping @bourbaki

bourbaki commented 5 years ago

Sorry for the late reply. I have been busy for the last couple of weeks. I started to work on it. I am planning to spend some time on this week.

CodingCat commented 5 years ago

@bourbaki are you still working on this? if you are too busy, I can take over it

bourbaki commented 5 years ago

@CodingCat Sorry, I am too busy right now. You can take over it.

trams commented 5 years ago

Hello @CodingCat. We allocated significant time to fix checkpointing performance issue in XGBoost4J-Spark (it is a blocker for us) in July-August. So I am wandering what is a current status of your work and whether you prepare any pull requests now or I could take this over.

Thank you

CodingCat commented 5 years ago

I am currently working on it

trams commented 5 years ago

@CodingCat could you please share the status of your work and confirm the approach you decided to take?

This issue blocks us (Criteo). We allocated resources to fix it. Let us know how we can contribute or how you can hand over if you have higher priorities.

CodingCat commented 5 years ago

it’s still WIP in my side (half finished), and I will share the design in the format of a RFC within this month (actually I have shared some in this thread) and code later ,

CodingCat commented 5 years ago

update:

basically, I have run the prototype fix with our production data and everything looks fine. Now, it comes to the cost of fix. The fix under test goes to the path that the prediction cache is rebuilt for each checkpoint interval and it still slows down the first iteration of each interval. However, a more fundamental fix requires API changes/adding, and the added API might be only consumed by jvm package, we need to evaluate whether it is worth that (or we have other better design option)

let me start a new design discussion thread on this topic later

trams commented 5 years ago

Thank you for the update. I am glad you have a fix and from your description I think I have a similar fix being tested on production dataset too.

I am looking forward for you sharing notes|patches so we can compare our notes.

By the way, my fix is published as a branch in our fork. I can link it here if you are interested. I did not submit a pull request cause it is not yet ready for it: there are some "unit tests" to show the effect of checkpointing and some not very clean logging.

trams commented 5 years ago

@CodingCat do you have any updates on your side?

I want to prepare a RFC with my idea for potential fixes. What do you think?

hcho3 commented 4 years ago

~Can this be closed, now that #5082 is merged?~ Never mind, I had to look at #4786 roadmap.

dmlc / xgboost

[jvm-packages] Checkpointing performance issue in XGBoost4J-Spark #3946