Open monicasenapati opened 3 years ago
Why are you running python 2? Just out of curiosity, I haven't seen it for a while now... It has reached its end of life and I don't think the latest XGBoost tracker can work with it.
For the error, it would be great if you have a reproducible example that I can run.
Hi @trivialfis I am not quite using Python. The tools I am working with are Spark, Hadoop, and Scala. Python 2.7 is what the cluster I am using comes with by default.
For the error, what kind of format would you prefer? Because this setup is done on a CloudLab cluster under our assigned projects and the files are very large, total amounting to more than 3TB of data stored in HDFS.
The tools I am working with are Spark, Hadoop, and Scala. Python 2.7 is what the cluster I am using comes with by default.
Yes, but Python is part of the dependency. So needs some care to make sure the correct Python interpreter is being used.
For the error, what kind of format would you prefer? Because this setup is done on a CloudLab cluster under our assigned projects and the files are very large, total amounting to more than 3TB of data stored in HDFS.
Would be great if you can debug it a little bit first. Like choosing the right Python environment, reduce the amount of data needed to reproduce the error. https://en.wikipedia.org/wiki/Minimal_working_example It's unlikely anyone here will try to reproduce your environment and run 3TB of data to help you debug the issue.
Yes, but Python is part of the dependency. So needs some care to make sure the correct Python interpreter is being used.
I see. Let me try that and get back to you. Just a quick background: I recently switched to Spark 3. I was running the same code on a slightly smaller dataset with Python 2.7 and Spark 2.3 without issues. Since the OS on the cluster got upgraded, I had to switch to these versions.
Could you please let me know what version of Python is the tracker compatible with?
Python 3.6.
Also, please consider checking the network environments etc. The inconsistent magic number is usually caused by some sudden error in one of the workers.
Thanks for the confirmation @trivialfis . I updated Python on the nodes to 3.6. Yet I continue to get the following error:
21/10/12 09:26:35 INFO RabitTracker$TrackerProcessLogger: 2021-10-12 11:26:35,965 INFO start listen on 130.127.133.143:9091
21/10/12 09:26:35 INFO XGBoostSpark: starting training with timeout set as 1800000 ms for waiting for resources
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: Exception in thread Thread-1:
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: Traceback (most recent call last):
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: self.run()
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: File "/usr/lib/python3.6/threading.py", line 864, in run
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: self._target(*self._args, **self._kwargs)
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: File "/tmp/tracker11295180445699876609.py", line 324, in run
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: self.accept_slaves(nslave)
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: File "/tmp/tracker11295180445699876609.py", line 268, in accept_slaves
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: s = SlaveEntry(fd, s_addr)
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: File "/tmp/tracker11295180445699876609.py", line 64, in __init__
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: assert magic == kMagic, 'invalid magic number=%d from %s' % (magic, self.host)
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: AssertionError: invalid magic number=542393671 from 185.173.35.17
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger:
21/10/12 17:48:45 INFO RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 0
21/10/12 17:48:45 INFO RabitTracker: Tracker Process ends with exit code 0
21/10/12 17:48:45 INFO XGBoostSpark: Rabit returns with exit code 0
I have also checked across the nodes that all the spark workers are connected and running.
Further checking the spark worker logs, I was able to find errors like the following:
21/10/12 08:45:34 INFO FileScanRDD: Reading File path: hdfs://vm0:9000/6mohe/containsLink/containsLink_ohe_4.csv, range: 31272730624-31406948352, partition values: [empty row]
21/10/12 08:46:35 WARN BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.10.1.24:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3441)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:665)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:874)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:926)
at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:116)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:65)
Seems it's failing at hadoop?
All the Hadoop nodes are healthy and so are the files in it too.
I was initially running on 16-node cluster. It worked well for two datasets. The current dataset is a scaled version of the first two datasets and hence larger in size. Therefore I moved to a 32 node cluster. Also, I have been running Spark in standalone mode on a 32 node cluster. I tried to run through Yarn too. The error persists. Each node in the cluster has 188G memory, and 40cores (as I can see from htop
output)
I was using the configuration for spark-submit as --master yarn --conf spark.driver.memory=120G --num-executors 31 --executor-cores 2 --executor-memory 80G
and num_workers -> 62
in the params map of XGboost. Could this be a issue? Or could this be better modified to make optimum use of the resources?
Hi! I wanted to confirm if XGBoost supports Spark version 3.1.2. I have been trying to run XGBoost on the latest version of Apache Spark on a dataset > 3TB on a 28 node cluster.
Also, I have been getting the following error and haven't been able to figure out what might be causing this.
I look forward to hearing back from you. Thank you in advance for your help and time.