2023-05-10 08:56 INFO 259781:cc_net.jsonql - preparing [<cc_net.minify.MetadataFetcher object at 0x7f6b262a5d60>, <cc_net.jsonql.where object at 0x7f6b262a5b20>, <cc_net.jsonql.where object at 0x7f6b262a5d30>]
2023-05-10 08:56 INFO 259781:cc_net.jsonql - Opening /tmp/wet_2019-09.paths.gz with mode 'rt'
2023-05-10 08:56 INFO 259781:root - Starting download of https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz
/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py:1102: UserWarning: Swallowed error 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz while downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz (1 out of 3)
warnings.warn(
/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py:1102: UserWarning: Swallowed error 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz while downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz (2 out of 3)
warnings.warn(
2023-05-10 08:57 INFO 259781:split - Processed 0 documents in 0.017h ( 0.0 doc/s).
2023-05-10 08:57 INFO 259781:split - Found 0 splits.
2023-05-10 08:57 INFO 259781:MetadataFetcher - Processed 0 documents in 0.017h ( 0.0 doc/s).
2023-05-10 08:57 INFO 259781:MetadataFetcher - Read 0, stocking 0 doc in 0.1g.
2023-05-10 08:57 INFO 259781:where - Selected 0 documents out of 0 ( 0.0%)
2023-05-10 08:57 INFO 259781:where - Selected 0 documents out of 0 ( 0.0%)
submitit ERROR (2023-05-10 08:57:54,174) - Submitted job triggered an exception
2023-05-10 08:57 ERROR 259781:submitit - Submitted job triggered an exception
Traceback (most recent call last):
File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
submitit_main()
File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/submission.py", line 72, in submitit_main
process_job(args.folder)
File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in process_job
raise error
File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/mine.py", line 432, in _mine_shard
jsonql.run_pipes(
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 455, in run_pipes
write_jsons(data, output)
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 496, in write_jsons
for res in source:
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 284, in map
for x in source:
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 277, in map
for x in source:
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/process_wet_file.py", line 206, in __iter__
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/process_wet_file.py", line 199, in open_segment
return jsonql.open_remote_file(url, cache=file)
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 1124, in open_remote_file
raw_bytes = request_get_content(url)
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 1101, in request_get_content
raise e
File "/home/admin1/Documents/hieu/Code/ccnet/cc_net/cc_net/jsonql.py", line 1095, in request_get_content
r.raise_for_status()
File "/home/admin1/miniconda3/envs/ccnetpy38/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://data.commoncrawl.org/crawl-data/CC-MAIN-2019-09/segments/1550247479159.2/wet/CC-MAIN-20190215204316-20190215230316-00200.warc.wet.gz
When I execute:
python -m cc_net --dump 2019-13
Here is the full log. Err: