ChunkedEncodingError & ConnectionResetError

Here's the log with command nohup python -m cc_net mine --dump 2019-13 > 2019-13.log 2>2019-13.err &:

2019-11-12 00:26 INFO 22835:HashesCollector - Processed 519_187 documents in 1e+01h ( 14.4 doc/s).
2019-11-12 00:26 INFO 22835:HashesCollector - Found 25_229k unique hashes over 90_967 lines. Using 3.6GB of RAM.
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Kept 43_340 documents over 45_437 (95.4%).
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Parsed 13 / 35 files. Estimated remaining time: 9.2h
2019-11-12 00:27 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz (1 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 01:16 INFO 22835:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz [200]
2019-11-12 01:16 INFO 22835:HashesCollector - Processed 562_527 documents in 1.1e+01h ( 14.4 doc/s).
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Kept 43_268 documents over 45_427 (95.2%).
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Parsed 14 / 35 files. Estimated remaining time: 17.7h
2019-11-12 01:17 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (1 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (2 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 02:11 INFO 22835:HashesCollector - Processed 605_794 documents in 1.2e+01h ( 14.3 doc/s).
2019-11-12 02:11 INFO 22835:HashesCollector - Found 0k unique hashes over 106_217 lines. Using 3.7GB of RAM.
Traceback (most recent call last):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 507, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/lib/python3.7/http/client.py", line 457, in read
    n = self.readinto(b)
  File "/usr/lib/python3.7/http/client.py", line 501, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 750, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 564, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 31, in <module>
    main()
  File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 27, in main
    command(**parsed_args)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 512, in main
    regroup(conf)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 364, in regroup
    mine(conf)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 257, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 206, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/data/myusername/projects/cc_net/cc_net/execution.py", line 128, in debug_executor
    message = function(*x)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 218, in _hashes_shard
    file=conf.get_cc_shard(shard),
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 448, in run_pipes
    for res in results:
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 295, in map
    for x in source:
  File "/data/myusername/projects/cc_net/cc_net/process_wet_file.py", line 198, in __iter__
    with jsonql.open_remote_file(self.segment_url(segment)) as f:
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1151, in open_remote_file
    content = io.BytesIO(request_get_content(url))
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1136, in request_get_content
    raise e
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1129, in request_get_content
    r = requests.get(url)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 686, in send
    r.content
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 828, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 753, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Is this just due to poor network connection between me and Amazon server (I'm in China)? If so, is it recommended to run the code from an AWS server located in US? If I don't have a C++17 compiler, how much memory do I need? Thanks a lot.

Btw, there are some errors/typos in the README.

the dump_id parameter is not found, so I changed it to dump
"on JSON object" -> "one JSON object"

Thanks for opening the bug.

I'm not sure what produces "connection reset by peer". In my experience it was only happening when I was downloading CC from ~500 processes, so I think it was just some kind of AWS rate limit. If you look at the code I sleep and retry 3 times when it happens. You can try to increase this setting by modifying the code .

But it looks like you were running it in a single process, so I'm not sure.

I did all of my runs from a datacenter based in California, let me know if using a datacenter from USA helps. The "Estimated remaining time" from your logs is also quite high, 9 hours while it was under one hour for me. Your connection may be the issue.

If you intend to run the full pipeline (python -m cc_net mine) then I suggest you get gcc7 and compile the binary for getpy. Currently it needs (from my memory, more details in the paper) 75 Gb of RAM per process, with 40Gb being the hashset. If you use standard python dict it will use around 100Gb just for the hashset.

I also suggest you look into distributing the workload. I don't have much knowledge with AWS, so I'm interested in learning if their are things I should change to make the code more easy to run on AWS.

If you just want the corpus with the text what you need is python -m cc_net reproduce --dump 2019-09 which doesn't require good hash sets.

It's very kind of you! I guess this is due to poor network connection, so I switched to AWS.

I don't want to deal with the C++17 issue, so I abandoned the full pipeline and decided to run the reproduction procedure with python -m cc_net reproduce --dump 2019-09, as you suggested.

For some complicated reason, I set up my AWS server in Singapore instead of US. The server instance has 8 cpus, 64 GB memory and 4 TB hard disk. Is this configuration good enough to finish the reproduction? If so, how long will it take in your opinion?

After ~20 hours of running, I have the following in the reconstruct folder:

ubuntu@<my-hostname>:/data/cc_net/data/reconstruct/2019-09$ ll
total 169704
drwxr-xr-x 3 root root     4096 Nov 13 02:27 ./
drwxr-xr-x 3 root root     4096 Nov 12 10:50 ../
-rw-r--r-- 1 root root  8078541 Nov 12 12:46 ar_head_0001.json.gz
-rw-r--r-- 1 root root  1701686 Nov 13 07:27 tmp.af_head_0000.json.gz
-rw-r--r-- 1 root root 50175486 Nov 13 07:28 tmp.ar_head_0000.json.gz
-rw-r--r-- 1 root root      136 Nov 12 16:03 tmp.ar_head_0000.json.gz.index
-rw-r--r-- 1 root root      136 Nov 12 12:46 tmp.ar_head_0001.json.gz.index
-rw-r--r-- 1 root root  2804071 Nov 13 07:28 tmp.az_head_0000.json.gz
-rw-r--r-- 1 root root  2688051 Nov 13 07:28 tmp.be_head_0000.json.gz
-rw-r--r-- 1 root root 22772810 Nov 13 07:28 tmp.bg_head_0000.json.gz
-rw-r--r-- 1 root root      136 Nov 13 02:27 tmp.bg_head_0000.json.gz.index
-rw-r--r-- 1 root root 13951603 Nov 13 07:28 tmp.bn_head_0000.json.gz
-rw-r--r-- 1 root root 11034433 Nov 13 07:28 tmp.ca_head_0000.json.gz
-rw-r--r-- 1 root root      136 Nov 12 11:52 tmp.ca_head_0000.json.gz.index
-rw-r--r-- 1 root root 60328129 Nov 13 07:28 tmp.cs_head_0000.json.gz
-rw-r--r-- 1 root root      152 Nov 12 20:35 tmp.cs_head_0000.json.gz.index
drwxr-xr-x 2 root root   155648 Nov 13 07:28 wet_cache/

and 1585 CC-MAIN-20190224044113-20190224070113-00599.warc.wet-like files in wet_cache, which consumes ~200 GB hard disk.

Is this normal? There are only ~10 *_head_1000.json.gz files and the program seems having stopped downloading and doing the corpus cleaning, but IIRC the paper says it'll generate a 3.2 TB corpus of ~100 languages. How could it generate such a clean corpus from a much smaller raw corpus?

Another issue I have to mention is, occasionally the program fails (sry that I can't find the error log right now), so I add a crontab task to detect the failure and restart it with the original command python -m cc_net reproduce --dump 2019-09 in case of failure. Is it OK to simply restarting it? Will it continue previous computation or just start from the scratch?

Looking forward to your reply!

the server instance has 8 cpus, 64 GB memory and 4 TB hard disk

This seems enough but the 8 cores will make it slow. I'd say 4 days if everything goes fine. Using 160 CPUs it runs under 4 hours. The first thing you can do is prioritize languages. Do you want all languages or just a subset ?

The tmp. files are quite small , 50Mb for Arabic while it should be around 4Gb. So I think the processed stopped very early. For the crashing, I'd bet on Out of Memory. Can you try running with --parallelism=1 and restrict to one language ? If you have logs, can you share lines saying something like:

2019-11-13 07:08 INFO 72609:Unminifier - Processed 8 documents in 0.016h (  0.1 doc/s).
2019-11-13 07:08 INFO 72609:Unminifier - Read 263_924, stocking 5 doc in 0.1Gb.

I observed that the memory usage can climb a bit high for languages with a lot of document because I'm doing some kind of buffering. If this is the issue, I'll look into reducing the RAM usage.

Restarting will drop all tmp files existing but all files without tmp will be kept. Also all documents in wet_cache are kept so you won't need to download them again. The wet_cache which is the original corpus is made of *.gz files while the number reported in the paper are uncompressed.

Update:

As you mentioned above, this program will heavily use cpus, so now I'm working with a 40-core server and 64 GB memory. Because it locates in China, I changed n_retry to 30 to let it recover from unstable network connection. Waiting for my good news.

The following session follows the previous post, but I have abandoned this approach. I'll keep it as a reference for others who want to reproduce this work.

=======================================================

Thanks for your feedback.

Do you want all languages or just a subset ?

Yes, I want the whole corpus including all languages.

I checked /var/log/syslog and found no memory issue or kill signals, so operating system didn't kill it. Finally I was able to find the error in the program's log:

2019-11-13 22:11 INFO 9578:Unminifier - ! Missed 14 documents (0.1%) !
2019-11-13 22:11 INFO 9578:Unminifier - ! Missed 120 paragraphs (0.0%) !
2019-11-13 22:11 INFO 9572:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247481428.19/wet/CC-MAIN-20190217010854-20190217032854-00303.warc.wet.gz [200]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/data/cc_net/cc_net/execution.py", line 145, in global_fn
    return f(*args[1:])
  File "/data/cc_net/cc_net/minify.py", line 289, in unminify_file
    jsonql.run_pipes(unminifier, file=iter(mini), output=tmp)
  File "/data/cc_net/cc_net/jsonql.py", line 448, in run_pipes
    for res in results:
  File "/data/cc_net/cc_net/jsonql.py", line 292, in map
    yield self(x)
  File "/data/cc_net/cc_net/jsonql.py", line 259, in __call__
    y = self.do(x)
  File "/data/cc_net/cc_net/jsonql.py", line 358, in do
    x = t(x)
  File "/data/cc_net/cc_net/jsonql.py", line 259, in __call__
    y = self.do(x)
  File "/data/cc_net/cc_net/minify.py", line 193, in do
    full_doc = self.retrieve_doc(segment, digest)
  File "/data/cc_net/cc_net/minify.py", line 178, in retrieve_doc
    with self.open_segment(segment) as f:
  File "/data/cc_net/cc_net/minify.py", line 159, in open_segment
    tmp.unlink()
  File "/usr/lib/python3.7/pathlib.py", line 1294, in unlink
    self._accessor.unlink(self)
FileNotFoundError: [Errno 2] No such file or directory: '/data/cc_net/data/reconstruct/2019-09/wet_cache/tmp_34c08f87.CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/cc_net/cc_net/__main__.py", line 31, in <module>
    main()
  File "/data/cc_net/cc_net/__main__.py", line 27, in main
    command(**parsed_args)
  File "/data/cc_net/cc_net/minify.py", line 368, in reproduce
    unminify(urls, output_dir / dump, execution, parallelism, cache_dir)
  File "/data/cc_net/cc_net/minify.py", line 329, in unminify
    ex(unminify_file, files, outputs, itertools.repeat(cache_dir))
  File "/data/cc_net/cc_net/execution.py", line 174, in __call__
    global_fn, zip(itertools.repeat(f_name), *args)
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: '/data/cc_net/data/reconstruct/2019-09/wet_cache/tmp_34c08f87.CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz'

Here's my current memory footprint:

ubuntu@<my-hostname>:/data/cc_net$ top
top - 03:19:36 up 1 day, 20:00,  1 user,  load average: 8.00, 8.00, 8.00
Tasks: 174 total,   9 running, 165 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.2 us,  0.7 sy,  0.0 ni,  0.0 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65332224 total,   356440 free, 20029108 used, 44946676 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 44719176 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14522 root      20   0  632756 113200   5368 R 100.0  0.2 306:00.58 python
14523 root      20   0 6807276 5.995g   5436 R 100.0  9.6 306:45.40 python
14526 root      20   0 2641508 2.023g   5432 R 100.0  3.2 307:10.08 python
14525 root      20   0  728648 209500   5776 R  99.7  0.3 305:58.95 python
14528 root      20   0 2168852 1.572g   5432 R  99.7  2.5 307:02.03 python
14529 root      20   0 7868240 7.006g   5436 R  99.7 11.2 306:36.43 python
14527 root      20   0 2272288 1.671g   5432 R  99.3  2.7 307:14.14 python
14524 root      20   0  993484 473840   5432 R  99.0  0.7 306:18.47 python
   63 root      20   0       0      0      0 S   0.3  0.0   4:27.24 kswapd0
    1 root      20   0   37964   5232   3228 S   0.0  0.0   0:05.33 systemd

Here's some log right before the crash as you requested:

2019-11-13 22:10 INFO 9575:cc_net.process_wet_file - Kept 43_893 documents over 44_765 (98.1%).
2019-11-13 22:10 INFO 9575:Unminifier - Processed 2_689 documents in 1.2e+01h (  0.1 doc/s).
2019-11-13 22:10 INFO 9575:Unminifier - Read 69_062_621, stocking 1_094 doc in 0.5Gb.
2019-11-13 22:10 INFO 9575:Unminifier - ! Missed 11 documents (0.4%) !
2019-11-13 22:10 INFO 9575:Unminifier - ! Missed 274 paragraphs (0.1%) !
2019-11-13 22:10 INFO 9575:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479627.17/wet/CC-MAIN-
20190215224408-20190216010408-00088.warc.wet.gz
2019-11-13 22:10 INFO 9573:cc_net.process_wet_file - Kept 43_502 documents over 44_286 (98.2%).
2019-11-13 22:10 INFO 9573:Unminifier - Processed 90_076 documents in 1.2e+01h (  2.1 doc/s).
2019-11-13 22:10 INFO 9573:Unminifier - Read 84_485_662, stocking 54_562 doc in 7e+00Gb.
2019-11-13 22:10 INFO 9573:Unminifier - ! Missed 21 documents (0.0%) !
2019-11-13 22:10 INFO 9573:Unminifier - ! Missed 291 paragraphs (0.0%) !
2019-11-13 22:10 INFO 9573:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-
20190215183319-20190215205319-00491.warc.wet.gz
2019-11-13 22:10 INFO 9574:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-2019021518
3319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9579:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9576:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9578:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9577:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9578:JsonReader - Processed 23_367 documents in 1.2e+01h (  0.6 doc/s).
2019-11-13 22:11 INFO 9578:Unminifier - Processed 23_366 documents in 1.2e+01h (  0.6 doc/s).
2019-11-13 22:11 INFO 9578:Unminifier - Read 84_485_662, stocking 14_523 doc in 2e+00Gb.
2019-11-13 22:11 INFO 9578:Unminifier - ! Missed 14 documents (0.1%) !
2019-11-13 22:11 INFO 9578:Unminifier - ! Missed 120 paragraphs (0.0%) !
2019-11-13 22:11 INFO 9572:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247481428.19/wet/CC-MAIN-20190217010854-20190217032854-00303.warc.wet.gz [200]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/data/cc_net/cc_net/execution.py", line 145, in global_fn
    return f(*args[1:])
  File "/data/cc_net/cc_net/minify.py", line 289, in unminify_file
    jsonql.run_pipes(unminifier, file=iter(mini), output=tmp)
  File "/data/cc_net/cc_net/jsonql.py", line 448, in run_pipes
    for res in results:
  File "/data/cc_net/cc_net/jsonql.py", line 292, in map
    yield self(x)
  File "/data/cc_net/cc_net/jsonql.py", line 259, in __call__
    y = self.do(x)
  File "/data/cc_net/cc_net/jsonql.py", line 358, in do
    x = t(x)
  File "/data/cc_net/cc_net/jsonql.py", line 259, in __call__
    y = self.do(x)
  File "/data/cc_net/cc_net/minify.py", line 193, in do
    full_doc = self.retrieve_doc(segment, digest)
  File "/data/cc_net/cc_net/minify.py", line 178, in retrieve_doc
    with self.open_segment(segment) as f:
  File "/data/cc_net/cc_net/minify.py", line 159, in open_segment
    tmp.unlink()
  File "/usr/lib/python3.7/pathlib.py", line 1294, in unlink
    self._accessor.unlink(self)
FileNotFoundError: [Errno 2] No such file or directory: '/data/cc_net/data/reconstruct/2019-09/wet_cache/tmp_34c08f87.CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz'
"""

So I guess the memory footprint is just fine, but I missed some raw corpus during downloading process? Is this related to restarting?

Now I have ~3k files in the wet_cache folder:

ubuntu@<my-hostname>:/data/cc_net$ ll data/reconstruct/2019-09/wet_cache | wc -l
3068

Is it correct? I remember the paper says each dump is divided into 1600 shards.

 File "/data/cc_net/cc_net/minify.py", line 178, in retrieve_doc
    with self.open_segment(segment) as f:
  File "/data/cc_net/cc_net/minify.py", line 159, in open_segment
    tmp.unlink()
  File "/usr/lib/python3.7/pathlib.py", line 1294, in unlink
    self._accessor.unlink(self)
FileNotFoundError: [Errno 2] No such file or directory: '/data/cc_net/data/reconstruct/2019-09/wet_cache/tmp_34c08f87.CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz'

I think you caught a subtle race condition. I've pushed 07e66aa6 to fix this. Sorry for the bug. Can you try with this commit ?

Now I have ~3k files in the wet_cache folder. Is it correct?

CommonCrawl is released in 64k "segments" (numbers may vary from one dump to another). In the paper we group those segments by 40 to have 1600 "shards" of around 4Gb, this make computation more efficient since we have big models to load for eah process. The corpus we released is also split in shard of 4Gb, but those are grouped by language and quality not by their original segment. So wet_cache should have 64k files at the end.

Wow, how lucky I am! This is not quite rare: I encounter it 3 to 4 times a day. I'll try the new code.

I'm still working with some network issues (latency, bindwidth, etc), so no progress yet. Btw, I checked the CommonCrawl website and found the total size of wet files in 2019-09 dump is 7.62 TB. If the program caches all of them on the local disk, a 4 TB hard disk seems to be inadequate. So what's the minimum requirement for hard disk? Maybe something like 15 TB (7.6 T raw wet files + 3.2 T clean corpus + some safety margin)?

@soloice I also face a similar problem as yours, my solution is using gcsfuse ( AWS equivalent will be s3fs ) to mount a remote cloud storage to store all the wet files while storing the tmp and clean corpus on a mounted disk.

My currently setup was mounting a Google cloud storage bucket as the wet_cache. I think you can do this as well using AWS S3 and mount it using s3fs. Also I also recommend to download wet files on a server within common crawl region( US I suppose? ) as the download speed was much faster to reduce idle process waiting for the download to finished.

Have you faced any memory error when processing large corpus such as english or german? 64G of memory doesn't seems enough to me. I ended up adding a large swap and have to restart my whole process again.

@theblackcat102 Hi, I didn't encounter any memory issue. This might be due to the short continuous running time (I have to restart my process every 4~12 hours because of network connection/file opening/etc errors).

I agree with you that the best practice is to run the program (both computation & storage) within US.

Reproducing this work is just a side project for me (I'm helping a data analyst colleague with this) and I have something more important to do now, so I'm not going to spend more time on this work within a month. Probably I'll come back a month later.

Did you manage to run the code in the end ? If you have some more tips you can share with future users of the repository that would be great and we could add them to the README.

Thanks for your following up. I didn't work on this project any further, so ...... But there's one thing I'm pretty sure: running the program inside U.S. can avoid many issues.

@gwenzek I written a post with tips how to recreate this in GCP. Basically use S3 or Google cloud bucket and mount them as disk will save you a lot of storage fees

facebookresearch / cc_net

ChunkedEncodingError & ConnectionResetError #2