Closed GenuineReader closed 1 year ago
The job expects as input a text file listing WARC/WAT/WET files (as path to a local file or S3 URL). According to the error message, looks like the job is reading a WAT file and without success tries to interpret every line as file name.
spark-3.3.2-bin-hadoop3/bin/spark-submit ./server_count. --num_output_partitions 1 --log_level WARN ./input/wat.gz servernames
23/02/18 09:20:39 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.41, 56238, None) 2023-02-18 09:20:52,155 INFO CountServers: Reading local file WARC/1.0 2023-02-18 09:20:52,156 ERROR CountServers: Failed to open /Users/joe/cc-pyspark/WARC/1.0: [Errno 2] No such file or directory: '/Users/joe/cc-pyspark/WARC/1.0' 2023-02-18 09:20:52,157 INFO CountServers: Reading local file WARC-Type: warcinfo 2023-02-18 09:20:52,158 ERROR CountServers: Failed to open /Users/joe/cc-pyspark/WARC-Type: warcinfo: [Errno 2] No such file or directory: '/Users/joe/cc-pyspark/WARC-Type: warcinfo' 2023-02-18 09:20:52,158 INFO CountServers: Reading local file WARC-Date: 2017-04-01T22:37:17Z 2023-02-18 09:20:52,159 ERROR CountServers: Failed to open /Users/joe/cc-pyspark/WARC-Date: 2017-04-01T22:37:17Z: [Errno 2] No such file or directory: '/Users/joe/cc-pyspark/WARC-Date: 2017-04-01T22:37:17Z' 2023-02-18 09:20:52,160 INFO CountServers: Reading local file WARC-Filename: CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.gz 2023-02-18 09:20:52,161 ERROR CountServers: Failed to open /Users/joe/cc-pyspark/WARC-Filename: CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.gz: [Errno 2] No such file or directory: '/Users/joe/cc-pyspark/WARC-Filename: CC-MAIN-20170322212946-00000-ip-10-233-31-227.ec2.internal.warc.gz' 2023-02-18 09:20:52,163 INFO CountServers: Reading local file WARC-Record-ID: