dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.26k stars 8.72k forks source link

stuck at "foreachPartition at XGBoost.scala:565" #10795

Open fkjhaflkjgg opened 2 months ago

fkjhaflkjgg commented 2 months ago

I met this issue many times. it sometimes almost hang for more than 10 hour while normal application just takes only 1 hour to succeed. small dataset (100w+ samples) and big dataset(1000w+ samples) both occur this issue.

XGBoostSpark: Running XGBoost 1.0.0 with parameters: alpha -> 0.0 min_child_weight -> 300.0 sample_type -> uniform base_score -> 0.5 weight_col -> rabit_timeout -> -1 colsample_bylevel -> 1.0 grow_policy -> depthwise skip_drop -> 0.0 lambda_bias -> 0.0 silent -> 0 scale_pos_weight -> 1.0 seed -> 0 cache_training_set -> false features_col -> features num_early_stopping_rounds -> 0 label_col -> label num_workers -> 200 subsample -> 1.0 lambda -> 1.0 max_depth -> 6 probability_col -> probability raw_prediction_col -> rawPrediction tree_limit -> 0 custom_eval -> null dmlc_worker_connect_retry -> 5 rate_drop -> 0.0 max_bin -> 16 train_test_ratio -> 1.0 use_external_memory -> false objective -> binary:logistic eval_metric -> auc num_round -> 200 timeout_request_workers -> 1800000 missing -> 0.0 rabit_ring_reduce_threshold -> 32768 checkpoint_path -> tracker_conf -> TrackerConf(0,python) tree_method -> hist max_delta_step -> 0.0 eta -> 0.15 verbosity -> 1 colsample_bytree -> 1.0 normalize_type -> tree allow_non_zero_for_missing -> false custom_obj -> null gamma -> 0.0 sketch_eps -> 0.03 nthread -> 1 prediction_col -> prediction checkpoint_interval -> -1

fkjhaflkjgg commented 2 months ago

The same issue as the issue 5013(https://github.com/dmlc/xgboost/issues/5013).

fkjhaflkjgg commented 2 months ago

when I change the tree_method to "approx", the application can succeed, but which may lost some precision.

fkjhaflkjgg commented 2 months ago

compare to logs of success application, the logs of stuck application lack of this sentences "24/09/02 12:04:09 INFO MemoryStore: Block rdd_43_0 stored as values in memory (estimated size 984.0 B, free 7.8 GB) 24/09/02 12:04:11 INFO Executor: 1 block locks were not released by TID = 3005: [rdd_43_0]" .

wbo4958 commented 2 months ago

@fkjhaflkjgg, could you try the latest XGBoost from https://mvnrepository.com/artifact/ml.dmlc ?