dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.29k stars 8.73k forks source link

Support for Spark version >= 3.1 and AssertionError query #7310

Open monicasenapati opened 3 years ago

monicasenapati commented 3 years ago

Hi! I wanted to confirm if XGBoost supports Spark version 3.1.2. I have been trying to run XGBoost on the latest version of Apache Spark on a dataset > 3TB on a 28 node cluster.

Also, I have been getting the following error and haven't been able to figure out what might be causing this.

21/10/10 10:50:17 INFO XGBoostSpark: starting training with timeout set as 1800000 ms for waiting for resources
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger: Exception in thread Thread-1:
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger: Traceback (most recent call last):
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:     self.run()
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:   File "/usr/lib/python2.7/threading.py", line 754, in run
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:     self.__target(*self.__args, **self.__kwargs)
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:   File "/tmp/tracker2488390078882395804.py", line 324, in run
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:     self.accept_slaves(nslave)
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:   File "/tmp/tracker2488390078882395804.py", line 268, in accept_slaves
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:     s = SlaveEntry(fd, s_addr)
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:   File "/tmp/tracker2488390078882395804.py", line 64, in __init__
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger:     assert magic == kMagic, 'invalid magic number=%d from %s' % (magic, self.host)
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger: AssertionError: invalid magic number=542393671 from 143.198.38.116
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger: 
21/10/10 13:44:27 INFO RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 0

I look forward to hearing back from you. Thank you in advance for your help and time.

trivialfis commented 3 years ago

Why are you running python 2? Just out of curiosity, I haven't seen it for a while now... It has reached its end of life and I don't think the latest XGBoost tracker can work with it.

trivialfis commented 3 years ago

For the error, it would be great if you have a reproducible example that I can run.

monicasenapati commented 3 years ago

Hi @trivialfis I am not quite using Python. The tools I am working with are Spark, Hadoop, and Scala. Python 2.7 is what the cluster I am using comes with by default.

monicasenapati commented 3 years ago

For the error, what kind of format would you prefer? Because this setup is done on a CloudLab cluster under our assigned projects and the files are very large, total amounting to more than 3TB of data stored in HDFS.

trivialfis commented 3 years ago

The tools I am working with are Spark, Hadoop, and Scala. Python 2.7 is what the cluster I am using comes with by default.

Yes, but Python is part of the dependency. So needs some care to make sure the correct Python interpreter is being used.

For the error, what kind of format would you prefer? Because this setup is done on a CloudLab cluster under our assigned projects and the files are very large, total amounting to more than 3TB of data stored in HDFS.

Would be great if you can debug it a little bit first. Like choosing the right Python environment, reduce the amount of data needed to reproduce the error. https://en.wikipedia.org/wiki/Minimal_working_example It's unlikely anyone here will try to reproduce your environment and run 3TB of data to help you debug the issue.

monicasenapati commented 3 years ago

Yes, but Python is part of the dependency. So needs some care to make sure the correct Python interpreter is being used.

I see. Let me try that and get back to you. Just a quick background: I recently switched to Spark 3. I was running the same code on a slightly smaller dataset with Python 2.7 and Spark 2.3 without issues. Since the OS on the cluster got upgraded, I had to switch to these versions.

monicasenapati commented 3 years ago

Could you please let me know what version of Python is the tracker compatible with?

trivialfis commented 3 years ago

Python 3.6.

Also, please consider checking the network environments etc. The inconsistent magic number is usually caused by some sudden error in one of the workers.

monicasenapati commented 3 years ago

Thanks for the confirmation @trivialfis . I updated Python on the nodes to 3.6. Yet I continue to get the following error:

21/10/12 09:26:35 INFO RabitTracker$TrackerProcessLogger: 2021-10-12 11:26:35,965 INFO start listen on 130.127.133.143:9091
21/10/12 09:26:35 INFO XGBoostSpark: starting training with timeout set as 1800000 ms for waiting for resources
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: Exception in thread Thread-1:
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: Traceback (most recent call last):
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:   File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:     self.run()
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:   File "/usr/lib/python3.6/threading.py", line 864, in run
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:     self._target(*self._args, **self._kwargs)
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:   File "/tmp/tracker11295180445699876609.py", line 324, in run
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:     self.accept_slaves(nslave)
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:   File "/tmp/tracker11295180445699876609.py", line 268, in accept_slaves
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:     s = SlaveEntry(fd, s_addr)
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:   File "/tmp/tracker11295180445699876609.py", line 64, in __init__
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:     assert magic == kMagic, 'invalid magic number=%d from %s' % (magic, self.host)
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: AssertionError: invalid magic number=542393671 from 185.173.35.17
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 0
21/10/12 17:48:45 INFO RabitTracker: Tracker Process ends with exit code 0
21/10/12 17:48:45 INFO XGBoostSpark: Rabit returns with exit code 0

I have also checked across the nodes that all the spark workers are connected and running.

monicasenapati commented 3 years ago

Further checking the spark worker logs, I was able to find errors like the following:

21/10/12 08:45:34 INFO FileScanRDD: Reading File path: hdfs://vm0:9000/6mohe/containsLink/containsLink_ohe_4.csv, range: 31272730624-31406948352, partition values: [empty row]
21/10/12 08:46:35 WARN BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.10.1.24:50010]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
        at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3441)
        at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777)
        at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694)
        at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:665)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:874)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:926)
        at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
        at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
        at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
        at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:116)
        at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
trivialfis commented 3 years ago

Seems it's failing at hadoop?

monicasenapati commented 3 years ago

All the Hadoop nodes are healthy and so are the files in it too.

monicasenapati commented 3 years ago

I was initially running on 16-node cluster. It worked well for two datasets. The current dataset is a scaled version of the first two datasets and hence larger in size. Therefore I moved to a 32 node cluster. Also, I have been running Spark in standalone mode on a 32 node cluster. I tried to run through Yarn too. The error persists. Each node in the cluster has 188G memory, and 40cores (as I can see from htop output) I was using the configuration for spark-submit as --master yarn --conf spark.driver.memory=120G --num-executors 31 --executor-cores 2 --executor-memory 80G and num_workers -> 62 in the params map of XGboost. Could this be a issue? Or could this be better modified to make optimum use of the resources?