facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
972 stars 142 forks source link

EOFError: Compressed file ended before the end-of-stream marker was reached #11

Closed zl827154659 closed 4 years ago

zl827154659 commented 4 years ago

Hi there, I was trying to run the code by MPExecutor but got the following error:

2020-07-23 20:44 INFO 156:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz [200]
2020-07-23 20:44 INFO 156:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512054.0/wet/CC-MAIN-20171211014442-20171211034442-00400.warc.wet.gz
2020-07-23 20:48 INFO 171:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512208.1/wet/CC-MAIN-20171211052406-20171211072406-00300.warc.wet.gz [200]
2020-07-23 20:48 INFO 171:HashesCollector - Processed 2_915 documents in 0.078h ( 10.4 doc/s).
2020-07-23 20:48 INFO 171:HashesCollector - Found 0k unique hashes over 522 lines. Using 0.1GB of RAM.
multiprocessing.pool.RemoteTraceback: 

Traceback (most recent call last):
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/cc_net-master/cc_net/execution.py", line 145, in global_fn
    return f(*args[1:])
  File "/home/cc_net-master/cc_net/mine.py", line 233, in _hashes_shard
    file=conf.get_cc_shard(shard),
  File "/home/cc_net-master/cc_net/jsonql.py", line 449, in run_pipes
    for res in results:
  File "/home/cc_net-master/cc_net/jsonql.py", line 296, in map
    for x in source:
  File "/home/cc_net-master/cc_net/process_wet_file.py", line 199, in __iter__
    for doc in parse_warc_file(iter(f), self.min_len):
  File "/home/cc_net-master/cc_net/process_wet_file.py", line 117, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/home/cc_net-master/cc_net/process_wet_file.py", line 89, in group_by_docs
    for warc in warc_lines:
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/gzip.py", line 300, in read1
    return self._buffer.read1(size)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/gzip.py", line 493, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cc_net-master/cc_net/__main__.py", line 24, in <module>
    main()
  File "/home/cc_net-master/cc_net/__main__.py", line 20, in main
    func_argparse.parse_and_call(parser)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/home/cc_net-master/cc_net/mine.py", line 524, in main
    regroup(conf)
  File "/home/cc_net-master/cc_net/mine.py", line 379, in regroup
    mine(conf)
  File "/home/cc_net-master/cc_net/mine.py", line 272, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/home/cc_net-master/cc_net/mine.py", line 221, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/home/cc_net-master/cc_net/execution.py", line 174, in __call__
    global_fn, zip(itertools.repeat(f_name), *args)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
EOFError: Compressed file ended before the end-of-stream marker was reached

and the config I've changed in the mine.py is just like this:

config_name: str = "base"
    dump: str = "2017-51"
    output_dir: Path = Path("data")  
    execution: str = "mp"
    num_shards: int = 800
    num_segments_per_shard: int = -1
    min_len: int = 300
    hash_in_mem: int = 25
    lang_whitelist: Sequence[str] = ["zh"]
    lang_blacklist: Sequence[str] = []
    lang_threshold: float = 0.5
    lm_dir: Path = Path("data/lm_sp")
    cutoff: Path = CUTOFF_CSV
    lm_languages: Optional[Sequence[str]] = ["zh"]
    mine_num_processes: int = 10
    target_size: str = "2G"
    cleanup_after_regroup: bool = True
    task_parallelism: int = 500
    pipeline: Sequence[str] = []
    experiments: Sequence[str] = []

I searched about this error and they all say that caused by the incomplete download file, but I saw your code annotation in jsonql.py func open_remote_file : "Download the files at the given url to memory and opens it as a file" How can I delete these incomplete download file in the memory ? or any other solution to fix this error ?

By the way ,the environment I was running the code is docker containner Ubuntu20.04

gwenzek commented 4 years ago

Did you try again ? The error means that the file downloaded from https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512208.1/wet/CC-MAIN-20171211052406-20171211072406-00300.warc.wet.gz isn't a valid gzip file. It looks like a silent network error.

In the dev branch you can chose to keep the downloaded file on the disk so you can inspect them manually (by setting cache_dir)

gwenzek commented 4 years ago

Closing since no activities, I'm trying to clean up my backlog. Feel free to reopen if you observed a non transient failure. Note that dev branch I was mentionning is now merged in master

datquocnguyen commented 1 year ago

Hi @gwenzek , I got the same error when running on a test mode on my local computer, but at a later processing stage. Is it because I am not running on slurm, or the bin file is not a valid file? Any solution to fix this error? Thanks.

/data/cc_net/cc_net/flat_hash_set.py:116: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try `pip install cc_net[getpy]
  "Module 'getpy' not found. Deduplication will take more RAM."
2023-02-01 14:54 INFO 27658:cc_net.jsonql - Opening test_data/wet_cache/2022-49/wet_2022-49.paths.gz with mode 'rt'
2023-02-01 14:54 INFO 27658:cc_net.jsonql - Opening test_data/wet_cache/2022-49/CC-MAIN-20221203075717-20221203105717-00000.warc.wet.gz with mode 'rt'
2023-02-01 14:54 INFO 27658:HashesCollector - Saved 27105 hashes to test_data/hashes/2022-49/0002.tmp.bin
2023-02-01 14:54 INFO 27658:HashesCollector - Processed 175 documents in 0.00058h ( 84.3 doc/s).
2023-02-01 14:54 INFO 27658:HashesCollector - Found 27k unique hashes over 40k lines. Using 0.1GB of RAM.
submitit ERROR (2023-02-01 14:54:04,589) - Submitted job triggered an exception
2023-02-01 14:54 ERROR 27658:submitit - Submitted job triggered an exception
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/submission.py", line 72, in submitit_main
    process_job(args.folder)
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/submission.py", line 65, in process_job
    raise error
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/home/sonla/.local/lib/python3.7/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/data/cc_net/cc_net/mine.py", line 276, in _hashes_shard
    inputs=conf.get_cc_shard(shard),
  File "/data/cc_net/cc_net/jsonql.py", line 455, in run_pipes
    write_jsons(data, output)
  File "/data/cc_net/cc_net/jsonql.py", line 496, in write_jsons
    for res in source:
  File "/data/cc_net/cc_net/jsonql.py", line 284, in map
    for x in source:
  File "/data/cc_net/cc_net/process_wet_file.py", line 216, in __iter__
    for doc in parse_warc_file(self.open_segment(segment), self.min_len):
  File "/data/cc_net/cc_net/process_wet_file.py", line 149, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/data/cc_net/cc_net/process_wet_file.py", line 122, in group_by_docs
    for warc in warc_lines:
  File "/data/cc_net/cc_net/jsonql.py", line 971, in _close_when_exhausted
    yield from file
  File "/usr/lib/python3.7/gzip.py", line 289, in read1
    return self._buffer.read1(size)
  File "/usr/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.7/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
yanzhouyoung commented 1 year ago

@datquocnguyen I have the same problem. The wet file(test_data/wet_cache/2022-49/CC-MAIN-20221203075717-20221203105717-00000.warc.wet.gz) is corrupted, you need to delete it and run the program again.