Open lsli8888 opened 1 month ago
We obtain the host address from a spark barrier context https://github.com/dmlc/xgboost/blob/778751a1bbc9a6ed539b54a36676351f1d185e63/python-package/xgboost/spark/utils.py#L57 not sure why the address cannot be assigned.
Hi @lsli8888, Could you be able try 2.1.0 which has the new rabit implementation.
Hi @wbo4958 - unfortunately, I got the same error. I used EMR Serverless 7.1.0/XGBoost 2.1.1 this time.
2024-08-05 15:04:28.150 [INFO] XGBoost-PySpark: Running xgboost-2.1.1 on 1 workers with
booster params: {'objective': 'reg:squarederror', 'device': 'cpu', 'eval_metric': 'rmse', 'learning_rate': 0.2, 'max_depth': 15, 'random_state': 0, 'subsample': 1.0, 'nthread': 2}
train_call_kwargs_params: {'early_stopping_rounds': 50, 'verbose_eval': True, 'num_boost_round': 1000}
dmatrix_kwargs: {'nthread': 2, 'missing': 0.0}
Traceback (most recent call last):
File "/tmp/spark-de893419-2160-4632-8c50-ce608e2f38fe/train_model.py", line 471, in <module>
main()
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_project_utils/utils/logger_utils.py", line 38, in wrapper
return func(*args, **kwargs)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/tmp/spark-de893419-2160-4632-8c50-ce608e2f38fe/train_model.py", line 351, in main
train_model(
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_project_utils/utils/logger_utils.py", line 55, in timed
result: Callable = method(*args, **kw)
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_project_utils/utils/logger_utils.py", line 38, in wrapper
return func(*args, **kwargs)
File "/tmp/spark-de893419-2160-4632-8c50-ce608e2f38fe/train_model.py", line 441, in train_model
model_training_strategy.train_model(
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_project/pyspark_utils/model_processor/xgboost_pipeline_model_training_strategy.py", line 40, in train_model
return model_training_pipeline.fit(my_dataframe), my_dataframe
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 205, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 134, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 205, in fit
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1125, in _fit
(config, booster) = _run_job()
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1111, in _run_job
ret = rdd_with_resource.collect()[0]
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1833, in collect
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(213, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1054, in _train_booster
_rabit_args = _get_rabit_args(context, num_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 69, in _get_rabit_args
env = _start_tracker(context, n_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 58, in _start_tracker
tracker = RabitTracker(n_workers=n_workers, host_ip=host, sortby="task")
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/tracker.py", line 73, in __init__
_check_call(_LIB.XGTrackerCreate(args, ctypes.byref(handle)))
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/core.py", line 284, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [15:04:50] /workspace/src/collective/result.cc:78:
- [socket.h:89|15:04:50]: Failed to bind socket. system error:Cannot assign requested address
Stack trace:
[bt] (0) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x22dbbc) [0x7f57217b9bbc]
[bt] (1) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x2ce6c1) [0x7f572185a6c1]
[bt] (2) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x2db3a9) [0x7f57218673a9]
[bt] (3) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(XGTrackerCreate+0x6b0) [0x7f57217eb550]
[bt] (4) /lib64/libffi.so.8(+0x7d16) [0x7f57451b5d16]
[bt] (5) /lib64/libffi.so.8(+0x456e) [0x7f57451b256e]
[bt] (6) /lib64/libffi.so.8(ffi_call+0x123) [0x7f57451b5373]
[bt] (7) /usr/lib64/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x8e5d) [0x7f57451c2e5d]
[bt] (8) /usr/lib64/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x8742) [0x7f57451c2742]
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:118)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:86)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:751)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1962)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3067)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3003)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3002)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3002)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:2326)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3265)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3205)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3194)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1041)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2406)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2427)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2446)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2471)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1046)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1045)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:195)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:840)
2024-08-05 15:04:50.852 [INFO] py4j.clientserver: Closing down clientserver connection
No idea if the port the Rabit tried to bind got not insufficient permissions.
Thanks @wbo4958. I've asked the same question in the AWS re:Post site - link here so I hope someone can provide some help.
Hey folks,
Just wanted to share a quick fix I found for a similar issue. I noticed RabitTracker was using spark.driver.host to bind to. By default, this is set to the local hostname, but in my case, it was pointing to a different IP.
The solution: I changed spark.driver.host to my local IP address, and it worked!
Quick thought: Wouldn't it make more sense for RabitTracker to use spark.driver.bindAddress instead? Just throwing it out there.
Hope this helps someone!
Hi @sgonullu - thanks for the info! What Python code did you use to get the local IP address? Something like this?
socket.gethostbyname(socket.gethostname())
Hi @sgonullu, Currently, RabitTracker is launched on the executor side, and we get the IP address from TaskContext info which is from spark. Looks like xgboost pyspark didn't use spark.driver.bindAddress
or spark.driver.host
Hi @lsli8888, @trivialfis, I once had a PR to launch the RabitTracker on the driver side https://github.com/dmlc/xgboost/pull/10281, I don't know if it can help on this case.
@wbo4958 I definitely got farther in my XGBoost training job on EMR Serverless, but it eventually fails with the following:
24/08/23 02:14:48 INFO ExecutorContainerAllocator: Going to request 8 executors for ResourceProfile Id: 0, target: 8 already provisioned: 0.
24/08/23 02:14:48 INFO DefaultEmrServerlessRMClient: Creating containers with container role SPARK_EXECUTOR and keys: Set(67, 66, 68, 72, 65, 71, 69, 70)
24/08/23 02:14:48 INFO DefaultEmrServerlessRMClient: Containers created with container role SPARK_EXECUTOR. key to container id map: Map(72 -> a8c8be81-ea98-f69e-73ae-6196e31cb572, 66 -> 36c8be81-ea98-b22f-dc37-d9b5912f0849, 69 -> 0ac8be81-eabc-8726-454f-71bd9fc78ee7, 71 -> 8ec8be81-eab6-b6f4-e57c-b60faa55494a, 68 -> 64c8be81-ea98-b44e-086a-46064575efa5, 65 -> 1cc8be81-eaa0-4037-c5ac-9561f7d5511f, 70 -> 6ac8be81-ea98-c6a3-69eb-750181e47089, 67 -> bcc8be81-eab9-160e-76c9-8e66c481d641)
24/08/23 02:14:53 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:15:08 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:15:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:15:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:15:53 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:16:08 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:16:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:16:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:16:48 WARN ExecutorContainerStoreImpl: Executor(s) 66,70,67,71,68,72,69,65 exited unexpectedly with exit code 1.
24/08/23 02:16:49 INFO ExecutorContainerAllocator: Going to request 8 executors for ResourceProfile Id: 0, target: 8 already provisioned: 0.
24/08/23 02:16:49 INFO DefaultEmrServerlessRMClient: Creating containers with container role SPARK_EXECUTOR and keys: Set(73, 78, 77, 79, 76, 80, 75, 74)
24/08/23 02:16:49 INFO DefaultEmrServerlessRMClient: Containers created with container role SPARK_EXECUTOR. key to container id map: Map(78 -> 6ec8be82-d562-e587-3a10-0e67f81810d1, 74 -> d2c8be82-d562-1c0e-73d4-ecf320291621, 80 -> dcc8be82-d561-f74a-35d1-02d07ae3dd71, 77 -> ecc8be82-d558-04a7-56e9-0bfc209399c6, 79 -> eac8be82-d558-7826-0a21-82a35cd22aea, 73 -> 0ac8be82-d558-0e2c-5365-14b762c38e6e, 76 -> a4c8be82-d561-05bf-cfe2-bed04501ad17, 75 -> 12c8be82-d558-62a8-a26f-ca8f6884daff)
24/08/23 02:16:53 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:17:08 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:17:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:17:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:17:53 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:18:08 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:18:18 WARN ExecutorContainerStoreImpl: Executor(s) 73,77 exited unexpectedly with exit code 1.
24/08/23 02:18:19 INFO ExecutorContainerAllocator: Going to request 2 executors for ResourceProfile Id: 0, target: 8 already provisioned: 6.
24/08/23 02:18:19 INFO DefaultEmrServerlessRMClient: Creating containers with container role SPARK_EXECUTOR and keys: Set(81, 82)
24/08/23 02:18:19 INFO DefaultEmrServerlessRMClient: Containers created with container role SPARK_EXECUTOR. key to container id map: Map(82 -> dec8be83-855a-e8c7-847a-62639d6ff0ad, 81 -> b2c8be83-855a-8e30-92bb-636e720e5829)
24/08/23 02:18:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:18:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:18:48 WARN ExecutorContainerStoreImpl: Executor(s) 78,74,75,79,76,80 exited unexpectedly with exit code 1.
24/08/23 02:18:49 INFO ExecutorContainerAllocator: Going to request 6 executors for ResourceProfile Id: 0, target: 8 already provisioned: 2.
24/08/23 02:18:49 INFO DefaultEmrServerlessRMClient: Creating containers with container role SPARK_EXECUTOR and keys: Set(84, 88, 83, 87, 86, 85)
24/08/23 02:18:49 INFO DefaultEmrServerlessRMClient: Containers created with container role SPARK_EXECUTOR. key to container id map: Map(84 -> 2ec8be83-c019-85b4-f9b3-2992972587e5, 87 -> 00c8be83-c020-aa9a-ac04-e93496fb13da, 86 -> 02c8be83-c019-a9dd-b655-05c739cf4e95, 83 -> dec8be83-c022-5139-87fe-c30139c90c34, 85 -> 58c8be83-c020-e199-977c-37d94e14055c, 88 -> 98c8be83-c019-c53f-b2c1-1e3aaeda0fa0)
24/08/23 02:18:53 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:19:08 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:19:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:19:38 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:19:53 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:20:08 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/08/23 02:20:18 ERROR EmrServerlessClusterSchedulerBackend: Abandoning job due to no executor being launched within 1200000ms after driver starts.
These warnings/errors repeat quite a few times. I have 256 max concurrent vCPUs for my account and I can see it ramps us to at least 56% of the available vCPUs.
I have the following settings for EMR Serverless as well:
spark_driver_memory = "16g"
spark_emr_serverless_driver_disk = "50g"
spark_emr_serverless_executor_disk = "50g"
spark_emr_serverless_memoryoverheadfactor = 0.1
spark_executor_cores = 16
spark_executor_memory = "16g"
spark_executor_instances = 8
spark_dynamic_allocation_enabled = "true"
spark_dynamic_allocation_max_executors = 24
spark_task_cpus = 2
I would assume a max of 24 executors at 16 GB RAM for each is enough? Or do I need to allocate even more?
Just for comparison - I am able to train on my M1 Mac which has 16 GB of RAM and a sampling of 0.02. The training run above uses sampling of 0.1 - so that's why I think I have more than enough Spark executors.
Hi @sgonullu - thanks for the info! What Python code did you use to get the local IP address? Something like this?
socket.gethostbyname(socket.gethostname())
I apologize for the confusion earlier. To clarify, I was troubleshooting the issue and submitted a test Spark job with --master local[*]
, which resulted in the same bind error. However, when I changed the spark.driver.host
setting, it worked with --master local[*]
. I assumed that the tracker would always start on the driver.
After @wbo4958 mentioned that the tracker starts on the executor, I reviewed the code to understand how the IP address is determined. I tested with the following code snippet:
def get_addr_list(iterator):
context = BarrierTaskContext.get()
infos = context.getTaskInfos()
addr_list = [info.address for info in context.getTaskInfos()]
yield addr_list
executors_addr = rdd.barrier().mapPartitions(get_addr_list).collect()
print(executor_addr)
Interestingly, when using --master local[*]
, the IP address returned was the spark.driver.host
value. However, when using --master yarn
, the IP address returned was not the expected worker IP address, but rather a network address (e.g. 10.42.0.0, I am not on EMR) that corresponds to my client's network. Note that my client is running in a container and is not on the same network as the executors.
I'm unsure whether this is due to a Spark configuration issue or something else. Any further guidance would be appreciated.
@lsli8888, From the logs, the executor exits with unexpected error when spark tried to "create" executor.
24/08/23 02:16:48 WARN ExecutorContainerStoreImpl: Executor(s) 66,70,67,71,68,72,69,65 exited unexpectedly with exit code 1.
You should check the reason here.
And to check such kind of issue, I would suggest you
1, to see if only 1 executor can work? 2, reduce the xgboost dataset to see if xgboost can work on only 1 executor with a few lines of dataset.
Hi @sgonullu, Can xgboost work on the scenario of --master yarn
? The BarrierTaskContext of each spark tasks should hold the IP address of each task, which is assigned by spark driver. So the IP should be accessible from driver.
@wbo4958 Good thoughts - I forgot to check the Spark executor logs. I only checked the driver logs.. I tried it with a single Spark driver and single Spark executor (max was also at 1 Spark executor). 16 GB of RAM each. A sampling rate of 0.0001 which is 1000 records. Unfortunately, I got the same error, but here's the Spark executor logs indicating a connection error. 10.0.176.163 is the spark.driver.host value that I got by invoking: socket.gethostbyname(socket.gethostname())
. Should I be setting spark.executor.host as well?
24/08/26 07:09:17 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 36@ip-10-0-142-117.us-east-2.compute.internal
24/08/26 07:09:17 INFO SignalUtils: Registering signal handler for TERM
24/08/26 07:09:17 INFO SignalUtils: Registering signal handler for HUP
24/08/26 07:09:17 INFO SignalUtils: Registering signal handler for INT
24/08/26 07:09:17 INFO SecurityManager: Changing view acls to: hadoop
24/08/26 07:09:17 INFO SecurityManager: Changing modify acls to: hadoop
24/08/26 07:09:17 INFO SecurityManager: Changing view acls groups to:
24/08/26 07:09:17 INFO SecurityManager: Changing modify acls groups to:
24/08/26 07:09:17 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: hadoop; groups with view permissions: EMPTY; users with modify permissions: hadoop; groups with modify permissions: EMPTY
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1915)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:478)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:467)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:498)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:496)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
... 4 more
Caused by: java.io.IOException: Failed to connect to /10.0.176.163:36391
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:294)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:226)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.0.176.163:36391
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840)
@lsli8888,
1, A sampling rate of 0.0001
is the xgboost parameter? if so, then the spark tasks still need to read all the data and do the sampling in xgboost. I suppose there's OOM according to your configuration. I would suggest you
train_df = train_df.head(10)
socket.gethostbyname(socket.gethostname())
Why you did this?
Can you make the sample code working without changing any XGBoost code?
>>> from xgboost.spark import SparkXGBClassifier
>>> from pyspark.ml.linalg import Vectors
>>> df_train = spark.createDataFrame([
... (Vectors.dense(1.0, 2.0, 3.0), 0, False, 1.0),
... (Vectors.sparse(3, {1: 1.0, 2: 5.5}), 1, False, 2.0),
... (Vectors.dense(4.0, 5.0, 6.0), 0, True, 1.0),
... (Vectors.sparse(3, {1: 6.0, 2: 7.5}), 1, True, 2.0),
... ], ["features", "label", "isVal", "weight"])
>>> df_test = spark.createDataFrame([
... (Vectors.dense(1.0, 2.0, 3.0), ),
... ], ["features"])
>>> xgb_classifier = SparkXGBClassifier(max_depth=5, missing=0.0,
... validation_indicator_col='isVal', weight_col='weight',
... early_stopping_rounds=1, eval_metric='logloss')
>>> xgb_clf_model = xgb_classifier.fit(df_train)
>>> xgb_clf_model.transform(df_test).show()
@wbo4958,
I set the spark.driver.host to:
socket.gethostbyname(socket.gethostname())
on the suggestion of @sgonullu. This was mentioned earlier in the thread. Let me know if you have a different opinion on this. I did find setting spark.driver.host this way did get me further in the training albeit with the errors I showed above.
I also limited the training data using:
train_df = train_df.limit(10)
Unfortunately, I got the same results that I did before which is the "Caused by: java.io.IOException: Failed to connect to" error that I showed above.
I will give your sample code a try soon.
@wbo4958
I tried your code sample on EMR Serverless. Interestingly, it gave the exact same errors as I was seeing before, i.e. the Caused by: java.io.IOException: Failed to connect to" error. The code did run locally, so there's something wrong with my EMR Serverless application that doesn't work with XGBoost. Again, my configuration works with the Spark GBTRegressor just fine on EMR Serverless (although the performance and results are quite bad - that's why I'm hoping to use XGBoost).
@lsli8888, I see, so the spark GBTRegressor can work?
But from the log, it seems something wrong with spark cluster where the executor couldn't connect to the driver. If so, the spark GBTRegressor
also shouldn't work.
Hi @wbo4958 - let me clarify. Here are my test scenarios:
SparkXGBClassifier (your sample code, spark.driver.host set to socket.gethostbyname(socket.gethostname()), get the connection failed error:
Caused by: java.io.IOException: Failed to connect to /x.x.x.x:xxxx
SparkXGBClassifier (your sample code, spark.driver.host not explicitly set:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1054, in _train_booster
_rabit_args = _get_rabit_args(context, num_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 69, in _get_rabit_args
env = _start_tracker(context, n_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 58, in _start_tracker
tracker = RabitTracker(n_workers=n_workers, host_ip=host, sortby="task")
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/tracker.py", line 73, in __init__
_check_call(_LIB.XGTrackerCreate(args, ctypes.byref(handle)))
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/core.py", line 284, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [05:47:20] /workspace/src/collective/result.cc:78:
- [socket.h:89|05:47:20]: Failed to bind socket. system error:Cannot assign requested address
Stack trace:
[bt] (0) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x22dbbc) [0x7fbf64686bbc]
[bt] (1) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x2ce6c1) [0x7fbf647276c1]
[bt] (2) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x2db3a9) [0x7fbf647343a9]
[bt] (3) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(XGTrackerCreate+0x6b0) [0x7fbf646b8550]
[bt] (4) /lib64/libffi.so.8(+0x7d16) [0x7fbf88087d16]
[bt] (5) /lib64/libffi.so.8(+0x456e) [0x7fbf8808456e]
[bt] (6) /lib64/libffi.so.8(ffi_call+0x123) [0x7fbf88087373]
[bt] (7) /usr/lib64/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x8e5d) [0x7fbf88094e5d]
[bt] (8) /usr/lib64/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x8742) [0x7fbf88094742]
GbtClassifier (your sample code, spark.driver.host set to socket.gethostbyname(socket.gethostname()), get the connection failed error:
Caused by: java.io.IOException: Failed to connect to /x.x.x.x:xxxx
GbtClassifier (slight modification of your sample code to use GBT classifier instead, spark.driver.host not explicitly set:
In summary, so setting spark.driver.host set to socket.gethostbyname(socket.gethostname()) for SparkXGBClassifier seems to get things along slightly further, but still fails. Interestingly, setting spark.driver.host set to socket.gethostbyname(socket.gethostname()) for GbtClassifier seems to mess things up...
@lsli8888, I see,
The issue is back to the original one, Cannot assign requested address
.
Hmm, I would suggest you to apply https://github.com/dmlc/xgboost/pull/10281 to xgboost. this PR which launches the RabitTracker on the driver side instead of executor side, that may help.
and try below code after applying
df_train = spark.createDataFrame(
[
(Vectors.dense(1.0, 2.0, 3.0), 0, False, 1.0),
(Vectors.sparse(3, {1: 1.0, 2: 5.5}), 1, False, 2.0),
(Vectors.dense(4.0, 5.0, 6.0), 0, True, 1.0),
(Vectors.sparse(3, {1: 6.0, 2: 7.5}), 1, True, 2.0),
]
* 100,
["features", "label", "isVal", "weight"],
)
callbacks = EvaluationMonitor()
xgb_regressor = SparkXGBRegressor(
num_workers=5,
callbacks=[callbacks],
tracker_on_driver=True,
validation_indicator_col="isVal",
)
xgb_reg_model = xgb_regressor.fit(df_train)
@wbo4958
Hi, unfortunately, I was not able to build xgboost with the changes from the commit:
https://github.com/dmlc/xgboost/commit/6d24bbd74c8fa1ed184124ea1c2229442f4e0bb6
Of note, GitHub shows the warning: "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository."
So I couldn't check out this commit hash directly.
I then tried checking out parent commit:
https://github.com/dmlc/xgboost/commit/a5a58102e5e82fa508514c34cd8e5f408dcfd3e1
Of which I then applied the changes from commit hash 6d24bbd. When I tried building via make, I get the error below:
In file included from /app/xgboost/src/c_api/c_api.cc:22:
/app/xgboost/src/c_api/../common/io.h:51:8: error: conflicting return type specified for 'virtual void xgboost::common::MemoryFixSizeBuffer::Write(const void*, std::size_t)'
51 | void Write(const void *ptr, std::size_t size) override {
| ^~~~~
In file included from /app/xgboost/dmlc-core/include/dmlc/data.h:14,
from /app/xgboost/include/xgboost/data.h:11,
from /app/xgboost/include/xgboost/predictor.h:12,
from /app/xgboost/src/c_api/../common/api_entry.h:10,
from /app/xgboost/src/c_api/c_api.cc:18:
/app/xgboost/dmlc-core/include/dmlc/./io.h:45:18: note: overridden function is 'virtual size_t dmlc::Stream::Write(const void*, size_t)'
45 | virtual size_t Write(const void* ptr, size_t size) = 0;
| ^~~~~
In file included from /app/xgboost/src/c_api/c_api.cc:22:
/app/xgboost/src/c_api/../common/io.h:94:8: error: conflicting return type specified for 'virtual void xgboost::common::MemoryBufferStream::Write(const void*, size_t)'
94 | void Write(const void *ptr, size_t size) override {
| ^~~~~
In file included from /app/xgboost/dmlc-core/include/dmlc/data.h:14,
from /app/xgboost/include/xgboost/data.h:11,
from /app/xgboost/include/xgboost/predictor.h:12,
from /app/xgboost/src/c_api/../common/api_entry.h:10,
from /app/xgboost/src/c_api/c_api.cc:18:
/app/xgboost/dmlc-core/include/dmlc/./io.h:45:18: note: overridden function is 'virtual size_t dmlc::Stream::Write(const void*, size_t)'
45 | virtual size_t Write(const void* ptr, size_t size) = 0;
| ^~~~~
In file included from /app/xgboost/src/c_api/c_api.cc:22:
/app/xgboost/src/c_api/../common/io.h:130:8: error: conflicting return type specified for 'virtual void xgboost::common::PeekableInStream::Write(const void*, size_t)'
130 | void Write(const void*, size_t) override {
| ^~~~~
In file included from /app/xgboost/dmlc-core/include/dmlc/data.h:14,
from /app/xgboost/include/xgboost/data.h:11,
from /app/xgboost/include/xgboost/predictor.h:12,
from /app/xgboost/src/c_api/../common/api_entry.h:10,
from /app/xgboost/src/c_api/c_api.cc:18:
/app/xgboost/dmlc-core/include/dmlc/./io.h:45:18: note: overridden function is 'virtual size_t dmlc::Stream::Write(const void*, size_t)'
45 | virtual size_t Write(const void* ptr, size_t size) = 0;
| ^~~~~
[ 19%] Building CXX object src/CMakeFiles/objxgboost.dir/collective/comm_group.cc.o
[ 20%] Building CXX object src/CMakeFiles/objxgboost.dir/collective/in_memory_handler.cc.o
[ 21%] Building CXX object src/CMakeFiles/objxgboost.dir/collective/loop.cc.o
gmake[2]: *** [src/CMakeFiles/objxgboost.dir/build.make:76: src/CMakeFiles/objxgboost.dir/c_api/c_api.cc.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
gmake[1]: *** [CMakeFiles/Makefile2:216: src/CMakeFiles/objxgboost.dir/all] Error 2
gmake: *** [Makefile:156: all] Error 2
I may be doing something wrong... so let me know!
@lsli8888, I just updated https://github.com/dmlc/xgboost/pull/10281 to the latest. You don't need to compile XGBoost, instead, you can just copy the changed python files from https://github.com/dmlc/xgboost/pull/10281 to replace the installed xgboost python files.
@wbo4958
I copied the changed Python files to the xgboost library in site-packages to replace the three XGBoost Python files (core.py, estimator.py, utils.py). Hopefully, I did this correctly!
Anyway, I then ran EMR Serverless with these changes and the updated SparkXgbRegressor code you suggested above. I initially ran the job with one Spark executor and 4 GB of RAM each for the driver and executor, 2 cores allocated for each task, and 4 driver and executor cores. I thought this would be more than enough resources, but the job failed with the following errors:
24/08/28 07:09:07 INFO DAGScheduler: ShuffleMapStage 0 (javaToPython at NativeMethodAccessorImpl.java:0) finished in 73.239 s
24/08/28 07:09:07 INFO DAGScheduler: looking for newly runnable stages
24/08/28 07:09:07 INFO DAGScheduler: running: Set()
24/08/28 07:09:07 INFO DAGScheduler: waiting: Set()
24/08/28 07:09:07 INFO DAGScheduler: failed: Set()
24/08/28 07:09:07 INFO SparkContext: Starting job: fit at /tmp/spark-61ee1a8b-2892-4143-83f9-b0d29080b620/train_model.py:430
24/08/28 07:09:07 WARN DAGScheduler: Barrier stage in job 1 requires 5 slots, but only 2 are available. Will retry up to 40 more times
24/08/28 07:09:22 WARN DAGScheduler: Barrier stage in job 1 requires 5 slots, but only 2 are available. Will retry up to 39 more times
...
24/08/28 07:11:02 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: No executor found for 2600:1f16:3b4:4e02:360f:ee62:383a:b8c1:45484
24/08/28 07:11:07 WARN DAGScheduler: Barrier stage in job 1 requires 5 slots, but only 2 are available. Will retry up to 32 more times
...
24/08/28 07:17:52 WARN DAGScheduler: Barrier stage in job 1 requires 5 slots, but only 2 are available. Will retry up to 5 more times
24/08/28 07:18:07 WARN DAGScheduler: Barrier stage in job 1 requires 5 slots, but only 2 are available. Will retry up to 4 more times
24/08/28 07:18:22 WARN DAGScheduler: Barrier stage in job 1 requires 5 slots, but only 2 are available. Will retry up to 3 more times
24/08/28 07:18:37 WARN DAGScheduler: Barrier stage in job 1 requires 5 slots, but only 2 are available. Will retry up to 2 more times
24/08/28 07:18:52 WARN DAGScheduler: Barrier stage in job 1 requires 5 slots, but only 2 are available. Will retry up to 1 more times
24/08/28 07:19:07 WARN DAGScheduler: Barrier stage in job 1 requires 5 slots, but only 2 are available. Will retry up to 0 more times
24/08/28 07:19:07 INFO DAGScheduler: Job 1 failed: fit at /tmp/spark-61ee1a8b-2892-4143-83f9-b0d29080b620/train_model.py:430, took 600.047621 s
24/08/28 07:19:08 INFO SparkContext: Invoking stop() from shutdown hook
24/08/28 07:19:08 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/08/28 07:19:08 INFO SparkUI: Stopped Spark web UI at http://[2600:1f16:3b4:4e02:bd93:7f0d:b127:4719]:4040
24/08/28 07:19:08 INFO EmrServerlessClusterSchedulerBackend: Shutting down all executors
24/08/28 07:19:08 INFO EmrServerlessClusterSchedulerBackend$EmrServerlessDriverEndpoint: Asking each executor to shut down
24/08/28 07:19:08 INFO TimeBasedRotatingEventLogFilesWriter: Renaming file:/var/log/spark/apps/eventlog_v2_00flvmvbrhnrc80f/00flvmvbrhnrc80f.inprogress to file:/var/log/spark/apps/eventlog_v2_00flvmvbrhnrc80f/events_1_00flvmvbrhnrc80f
24/08/28 07:19:08 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/08/28 07:19:08 INFO MemoryStore: MemoryStore cleared
24/08/28 07:19:08 INFO BlockManager: BlockManager stopped
24/08/28 07:19:08 INFO BlockManagerMaster: BlockManagerMaster stopped
24/08/28 07:19:08 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/08/28 07:19:10 INFO SparkContext: Successfully stopped SparkContext
24/08/28 07:19:10 INFO ShutdownHookManager: Shutdown hook called
I then ran the job with one Spark executor and 8 GB of RAM each for the driver and executor, 2 cores allocated for each task, and 8 driver and executor cores. It failed again with a similar error:
...
Barrier stage in job 1 requires 5 slots, but only 4 are available. Will retry up to 5 more times
...
With this info, I then upped the resources again to one Spark executor and 16 GB of RAM each for the driver and executor, 2 cores allocated for each task, and 16 driver and executor cores. I then got the following error:
24/08/28 08:18:25 INFO BarrierCoordinator: Current barrier epoch for Stage 2 (Attempt 0) is 0.
24/08/28 08:18:25 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 2 (Attempt 0) received update from Task 17, current progress: 1/5.
24/08/28 08:18:25 INFO BarrierCoordinator: Current barrier epoch for Stage 2 (Attempt 0) is 0.
24/08/28 08:18:25 INFO BarrierCoordinator: Barrier sync epoch 0 from Stage 2 (Attempt 0) received update from Task 19, current progress: 2/5.
24/08/28 08:18:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 20) ([2600:1f16:3b4:4e00:7237:adba:2141:134c] executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1054, in _train_booster
_rabit_args = _get_rabit_args(context, num_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 69, in _get_rabit_args
env = _start_tracker(context, n_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 58, in _start_tracker
tracker = RabitTracker(n_workers=n_workers, host_ip=host, sortby="task")
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/tracker.py", line 73, in __init__
_check_call(_LIB.XGTrackerCreate(args, ctypes.byref(handle)))
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/core.py", line 284, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [08:18:25] /workspace/src/collective/result.cc:78:
- [socket.h:89|08:18:25]: Failed to bind socket. system error:Cannot assign requested address
Stack trace:
[bt] (0) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x22dbbc) [0x7f2022d05bbc]
[bt] (1) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x2ce6c1) [0x7f2022da66c1]
[bt] (2) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x2db3a9) [0x7f2022db33a9]
[bt] (3) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(XGTrackerCreate+0x6b0) [0x7f2022d37550]
[bt] (4) /lib64/libffi.so.8(+0x7d16) [0x7f204668bd16]
[bt] (5) /lib64/libffi.so.8(+0x456e) [0x7f204668856e]
[bt] (6) /lib64/libffi.so.8(ffi_call+0x123) [0x7f204668b373]
[bt] (7) /usr/lib64/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x8e5d) [0x7f2046698e5d]
[bt] (8) /usr/lib64/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x8742) [0x7f2046698742]
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:118)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:86)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:751)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1962)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)
24/08/28 08:18:25 INFO DAGScheduler: Marking ResultStage 2 (fit at /tmp/spark-5efa145b-ef07-496a-8825-3d2e2f37acb7/train_model.py:430) as failed due to a barrier task failed.
24/08/28 08:18:25 INFO TaskSchedulerImpl: Killing all running tasks in stage 2: Task ResultTask(2, 0) from barrier stage ResultStage 2 (fit at /tmp/spark-5efa145b-ef07-496a-8825-3d2e2f37acb7/train_model.py:430) failed.
24/08/28 08:18:25 INFO DAGScheduler: ResultStage 2 (fit at /tmp/spark-5efa145b-ef07-496a-8825-3d2e2f37acb7/train_model.py:430) failed in 25.776 s due to Stage failed because barrier task ResultTask(2, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1054, in _train_booster
_rabit_args = _get_rabit_args(context, num_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 69, in _get_rabit_args
env = _start_tracker(context, n_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 58, in _start_tracker
tracker = RabitTracker(n_workers=n_workers, host_ip=host, sortby="task")
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/tracker.py", line 73, in __init__
_check_call(_LIB.XGTrackerCreate(args, ctypes.byref(handle)))
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/core.py", line 284, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [08:18:25] /workspace/src/collective/result.cc:78:
- [socket.h:89|08:18:25]: Failed to bind socket. system error:Cannot assign requested address
Stack trace:
[bt] (0) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x22dbbc) [0x7f2022d05bbc]
[bt] (1) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x2ce6c1) [0x7f2022da66c1]
[bt] (2) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(+0x2db3a9) [0x7f2022db33a9]
[bt] (3) /home/hadoop/environment/lib64/python3.9/site-packages/xgboost/lib/libxgboost.so(XGTrackerCreate+0x6b0) [0x7f2022d37550]
[bt] (4) /lib64/libffi.so.8(+0x7d16) [0x7f204668bd16]
[bt] (5) /lib64/libffi.so.8(+0x456e) [0x7f204668856e]
[bt] (6) /lib64/libffi.so.8(ffi_call+0x123) [0x7f204668b373]
[bt] (7) /usr/lib64/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x8e5d) [0x7f2046698e5d]
[bt] (8) /usr/lib64/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so(+0x8742) [0x7f2046698742]
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:118)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:86)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:751)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1962)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)
24/08/28 08:18:25 INFO DAGScheduler: Resubmitting ResultStage 2 (fit at /tmp/spark-5efa145b-ef07-496a-8825-3d2e2f37acb7/train_model.py:430) due to barrier stage failure.
So ultimately, the job still failed. However, I hope this extra info helps debug this issue. Thanks!
@wbo4958
Here's the standard output for the one Spark executor and 16 GB of RAM each for the driver and executor, 2 cores allocated for each task, and 16 driver and executor cores configuration. Do I need to allocate even more resources when running the Spark job? If so, that wouldn't seem to make sense as the script is so simple, the initial dataframe is so small...
2024-08-28 07:33:38.492 [WARNING] SparkXGBRegressor: The num_workers 5 set for xgboost distributed training is greater than current max number of concurrent spark task slots, you need wait until more task slots available or you need increase spark cluster workers.
2024-08-28 07:33:39.327 [INFO] XGBoost-PySpark: Running xgboost-2.1.1 on 5 workers with
booster params: {'objective': 'reg:squarederror', 'device': 'cpu', 'tracker_on_driver': True, 'nthread': 2}
train_call_kwargs_params: {'callbacks': [<xgboost.callback.EvaluationMonitor object at 0x7f539ec9dc40>], 'verbose_eval': True, 'num_boost_round': 100}
dmatrix_kwargs: {'nthread': 2, 'missing': nan}
Traceback (most recent call last):
File "/tmp/spark-b02a269f-dd3e-4b60-97ca-a68e9758c344/train_model.py", line 511, in <module>
main()
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_utils/utils/logger_utils.py", line 43, in wrapper
return func(*args, **kwargs)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/tmp/spark-b02a269f-dd3e-4b60-97ca-a68e9758c344/train_model.py", line 365, in main
train_model(
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_utils/utils/logger_utils.py", line 60, in timed
result: Callable = method(*args, **kw)
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_utils/utils/logger_utils.py", line 43, in wrapper
return func(*args, **kwargs)
File "/tmp/spark-b02a269f-dd3e-4b60-97ca-a68e9758c344/train_model.py", line 430, in train_model
xgb_reg_model = xgb_regressor.fit(df_train)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 205, in fit
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1125, in _fit
(config, booster) = _run_job()
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1111, in _run_job
ret = rdd_with_resource.collect()[0]
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1833, in collect
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: [SPARK-24819]: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently. Please init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage.
at org.apache.spark.errors.SparkCoreErrors$.numPartitionsGreaterThanMaxNumConcurrentTasksError(SparkCoreErrors.scala:235)
at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithNumSlots(DAGScheduler.scala:609)
at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:687)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1355)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3214)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3205)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3194)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1041)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2406)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2427)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2446)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2471)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1046)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:407)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1045)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:195)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:840)
Also, looking the Spark executor logs, it seems like the JVM resources seem fine:
[703.318s][info ][gc,heap,exit ] Heap
[703.318s][info ][gc,heap,exit ] PSYoungGen total 2446848K, used 793613K [0x0000000755580000, 0x0000000800000000, 0x0000000800000000)
[703.318s][info ][gc,heap,exit ] eden space 2097664K, 37% used [0x0000000755580000,0x0000000785c837b0,0x00000007d5600000)
[703.318s][info ][gc,heap,exit ] from space 349184K, 0% used [0x00000007eab00000,0x00000007eab00000,0x0000000800000000)
[703.318s][info ][gc,heap,exit ] to space 349184K, 0% used [0x00000007d5600000,0x00000007d5600000,0x00000007eab00000)
[703.318s][info ][gc,heap,exit ] ParOldGen total 5592576K, used 104307K [0x0000000600000000, 0x0000000755580000, 0x0000000755580000)
[703.318s][info ][gc,heap,exit ] object space 5592576K, 1% used [0x0000000600000000,0x00000006065dccb8,0x0000000755580000)
[703.318s][info ][gc,heap,exit ] Metaspace used 91931K, committed 92992K, reserved 1179648K
[703.318s][info ][gc,heap,exit ] class space used 12356K, committed 12864K, reserved 1048576K
Hi @lsli8888
24/08/28 08:18:25 INFO DAGScheduler: Marking ResultStage 2 (fit at /tmp/spark-5efa145b-ef07-496a-8825-3d2e2f37acb7/train_model.py:430) as failed due to a barrier task failed.
24/08/28 08:18:25 INFO TaskSchedulerImpl: Killing all running tasks in stage 2: Task ResultTask(2, 0) from barrier stage ResultStage 2 (fit at /tmp/spark-5efa145b-ef07-496a-8825-3d2e2f37acb7/train_model.py:430) failed.
24/08/28 08:18:25 INFO DAGScheduler: ResultStage 2 (fit at /tmp/spark-5efa145b-ef07-496a-8825-3d2e2f37acb7/train_model.py:430) failed in 25.776 s due to Stage failed because barrier task ResultTask(2, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1054, in _train_booster
_rabit_args = _get_rabit_args(context, num_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 69, in _get_rabit_args
env = _start_tracker(context, n_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 58, in _start_tracker
tracker = RabitTracker(n_workers=n_workers, host_ip=host, sortby="task")
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/tracker.py", line 73, in __init__
_check_call(_LIB.XGTrackerCreate(args, ctypes.byref(handle)))
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/core.py", line 284, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [08:18:25] /workspace/src/collective/result.cc:78:
- [socket.h:89|08:18:25]: Failed to bind socket. system error:Cannot assign requested address
Looks like the Rabit Tracker was started on the Executor instead of Driver side according to the File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1054, in _train_booster
The correct code starting RabitTracker should be at https://github.com/dmlc/xgboost/pull/10281/files#diff-8900fd245992de81e0a7b8cb7fcd18c37bae4473e8672e9b11fd7010cac358cfR1027-R1035
Could you please double check if you have really successfuly replaced core.py/estimator.py/utils.py under xgboost/pyspark
Hi @wbo4958. Yep, I definitely messed up my Dockerfile and the three files were not copied into the site-packages directory. I'm sorry about that mistake! However, I have fixed the issue now.
I tried again for a configuration consisting of a single Spark executor and 16 GB of RAM each for the driver and executor, 2 cores allocated for each task, and 16 driver and executor cores. The EMR Serverless job once again failed, albeit with different errors this time.
Here's a snippet of the standard output from the Spark driver log:
Traceback (most recent call last):
File "/tmp/spark-2df01013-276b-46b4-ae1b-60ee4cf5eb88/train_model.py", line 511, in <module>
main()
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_utils/utils/logger_utils.py", line 43, in wrapper
return func(*args, **kwargs)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/hadoop/environment/lib64/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/tmp/spark-2df01013-276b-46b4-ae1b-60ee4cf5eb88/train_model.py", line 365, in main
train_model(
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_utils/utils/logger_utils.py", line 60, in timed
result: Callable = method(*args, **kw)
File "/home/hadoop/environment/lib64/python3.9/site-packages/my_utils/utils/logger_utils.py", line 43, in wrapper
return func(*args, **kwargs)
File "/tmp/spark-2df01013-276b-46b4-ae1b-60ee4cf5eb88/train_model.py", line 430, in train_model
xgb_reg_model = xgb_regressor.fit(df_train)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 205, in fit
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/core.py", line 1035, in _fit
rabit_args = _get_rabit_args(driver_host, num_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 68, in _get_rabit_args
env = _start_tracker(host, n_workers)
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/spark/utils.py", line 57, in _start_tracker
tracker = RabitTracker(n_workers=n_workers, host_ip=host, sortby="task")
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/tracker.py", line 64, in __init__
get_family(host_ip) # use python socket to stop early for invalid address
File "/home/hadoop/environment/lib64/python3.9/site-packages/xgboost/tracker.py", line 14, in get_family
return socket.getaddrinfo(addr, None)[0][0]
File "/usr/lib64/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
2024-08-29 01:06:15.809 [INFO] py4j.clientserver: Closing down clientserver connection
Here's a snippet of the standard error from the Spark driver log:
24/08/29 01:06:15 ERROR TransportRequestHandler: Error sending result StreamResponse[streamId=/files/169448f8c7f5d18bef0895058b2ef6b518055c10d02a26e665865d7f4e7cb9c1.zip,byteCount=2263051466,body=FileSegmentManagedBuffer[file=/tmp/spark-2df01013-276b-46b4-ae1b-60ee4cf5eb88/169448f8c7f5d18bef0895058b2ef6b518055c10d02a26e665865d7f4e7cb9c1.zip,offset=0,length=2263051466]] to /[2600:1f16:ab4:a00:af9a:a730:9175:613d]:39094; closing connection
io.netty.channel.StacklessClosedChannelException: null
at io.netty.channel.AbstractChannel.close(ChannelPromise)(Unknown Source) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
24/08/29 01:06:17 INFO SparkContext: Successfully stopped SparkContext
Here's a snippet of the standard error from the Spark executor log:
24/08/29 01:06:14 INFO Executor: Fetching spark://[2600:1f16:ab4:a00:18cb:4e30:7054:cf7a]:38681/files/169448f8c7f5d18bef0895058b2ef6b518055c10d02a26e665865d7f4e7cb9c1.zip#environment with timestamp 1724893537240
24/08/29 01:06:14 INFO Utils: Fetching spark://[2600:1f16:ab4:a00:18cb:4e30:7054:cf7a]:38681/files/169448f8c7f5d18bef0895058b2ef6b518055c10d02a26e665865d7f4e7cb9c1.zip to /tmp/spark-5e0ba5d2-f802-49e7-ac39-5466390b7564/fetchFileTemp7332881780438855624.tmp
24/08/29 01:06:24 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Unable to create executor due to null
java.nio.channels.ClosedChannelException: null
at org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:62) ~[spark-network-common_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:223) ~[spark-network-common_2.12-3.5.0-amzn-1.jar:3.5.0-amzn-1]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:566) ~[netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.100.Final.jar:4.1.100.Final]
at java.lang.Thread.run(Thread.java:840) [?:?]
24/08/29 01:06:24 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
24/08/29 01:06:24 INFO CoarseGrainedExecutorBackend: Driver from 2600:1f16:ab4:a00:18cb:4e30:7054:cf7a:38681 disconnected during shutdown
24/08/29 01:06:24 INFO CoarseGrainedExecutorBackend: Driver from [2600:1f16:ab4:a00:18cb:4e30:7054:cf7a]:38681 disconnected during shutdown
Hi @trivialfis, I don't know where it is wrong, could you help take a loot at it?
socket.gaierror: [Errno -2] Name or service not known
This error is raised from the Python socket module to signal an invalid address. We will have to spin up a EMR cluster to debug it's network configuration. Maybe it's firewall, maybe there are some quirks on EMR, I don't know. Safe to assume that spark is returning an address that can not be used by normal TCP socket.
I have seen some errors caused by invalid DNS configuration on another platform, Networking can be tricky.
Thanks @trivialfis! FYI, I'm using EMR Serverless rather than the regular EMR. Not sure if that makes too much of a difference, but just wanted to let you know.
Environment
AWS EMR Serverless 7.0.0 PySpark 3.5.0 XGBoost 2.0.3
I’m using XBoost for regression, specifically the SparkXGBRegressor. I’m able to use it without issues on my local machine. However, I get the following Rabit-related error when executing it on AWS EMR Serverless. Any ideas? Is it related to this?: