Closed zl827154659 closed 4 years ago
Did you try again ? The error means that the file downloaded from https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512208.1/wet/CC-MAIN-20171211052406-20171211072406-00300.warc.wet.gz isn't a valid gzip file. It looks like a silent network error.
In the dev branch you can chose to keep the downloaded file on the disk so you can inspect them manually (by setting cache_dir)
Closing since no activities, I'm trying to clean up my backlog. Feel free to reopen if you observed a non transient failure. Note that dev branch I was mentionning is now merged in master
Hi @gwenzek , I got the same error when running on a test mode on my local computer, but at a later processing stage. Is it because I am not running on slurm, or the bin file is not a valid file? Any solution to fix this error? Thanks.
/data/cc_net/cc_net/flat_hash_set.py:116: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try `pip install cc_net[getpy]
"Module 'getpy' not found. Deduplication will take more RAM."
2023-02-01 14:54 INFO 27658:cc_net.jsonql - Opening test_data/wet_cache/2022-49/wet_2022-49.paths.gz with mode 'rt'
2023-02-01 14:54 INFO 27658:cc_net.jsonql - Opening test_data/wet_cache/2022-49/CC-MAIN-20221203075717-20221203105717-00000.warc.wet.gz with mode 'rt'
2023-02-01 14:54 INFO 27658:HashesCollector - Saved 27105 hashes to test_data/hashes/2022-49/0002.tmp.bin
2023-02-01 14:54 INFO 27658:HashesCollector - Processed 175 documents in 0.00058h ( 84.3 doc/s).
2023-02-01 14:54 INFO 27658:HashesCollector - Found 27k unique hashes over 40k lines. Using 0.1GB of RAM.
submitit ERROR (2023-02-01 14:54:04,589) - Submitted job triggered an exception
2023-02-01 14:54 ERROR 27658:submitit - Submitted job triggered an exception
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/_submit.py", line 11, in <module>
submitit_main()
File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/submission.py", line 72, in submitit_main
process_job(args.folder)
File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/submission.py", line 65, in process_job
raise error
File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/submission.py", line 54, in process_job
result = delayed.result()
File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/utils.py", line 133, in result
self._result = self.function(*self.args, **self.kwargs)
File "/data/cc_net/cc_net/mine.py", line 276, in _hashes_shard
inputs=conf.get_cc_shard(shard),
File "/data/cc_net/cc_net/jsonql.py", line 455, in run_pipes
write_jsons(data, output)
File "/data/cc_net/cc_net/jsonql.py", line 496, in write_jsons
for res in source:
File "/data/cc_net/cc_net/jsonql.py", line 284, in map
for x in source:
File "/data/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
for doc in parse_warc_file(self.open_segment(segment), self.min_len):
File "/data/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
for doc in group_by_docs(lines):
File "/data/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
for warc in warc_lines:
File "/data/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
yield from file
File "/usr/lib/python3.7/gzip.py", line 289, in read1
return self._buffer.read1(size)
File "/usr/lib/python3.7/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.7/gzip.py", line 482, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
@datquocnguyen I have the same problem. The wet file(test_data/wet_cache/2022-49/CC-MAIN-20221203075717-20221203105717-00000.warc.wet.gz) is corrupted, you need to delete it and run the program again.
Hi there, I was trying to run the code by MPExecutor but got the following error:
and the config I've changed in the mine.py is just like this:
I searched about this error and they all say that caused by the incomplete download file, but I saw your code annotation in
jsonql.py
funcopen_remote_file
: "Download the files at the given url to memory and opens it as a file" How can I delete these incomplete download file in the memory ? or any other solution to fix this error ?By the way ,the environment I was running the code is docker containner Ubuntu20.04