Closed soloice closed 5 years ago
Btw, there are some errors/typos in the README.
dump_id
parameter is not found, so I changed it to dump
Thanks for opening the bug.
I'm not sure what produces "connection reset by peer". In my experience it was only happening when I was downloading CC from ~500 processes, so I think it was just some kind of AWS rate limit. If you look at the code I sleep and retry 3 times when it happens. You can try to increase this setting by modifying the code .
But it looks like you were running it in a single process, so I'm not sure.
I did all of my runs from a datacenter based in California, let me know if using a datacenter from USA helps. The "Estimated remaining time" from your logs is also quite high, 9 hours while it was under one hour for me. Your connection may be the issue.
If you intend to run the full pipeline (python -m cc_net mine
) then I suggest you get gcc7 and compile the binary for getpy. Currently it needs (from my memory, more details in the paper) 75 Gb of RAM per process, with 40Gb being the hashset. If you use standard python dict it will use around 100Gb just for the hashset.
I also suggest you look into distributing the workload. I don't have much knowledge with AWS, so I'm interested in learning if their are things I should change to make the code more easy to run on AWS.
If you just want the corpus with the text what you need is python -m cc_net reproduce --dump 2019-09
which doesn't require good hash sets.
It's very kind of you! I guess this is due to poor network connection, so I switched to AWS.
I don't want to deal with the C++17 issue, so I abandoned the full pipeline and decided to run the reproduction procedure with python -m cc_net reproduce --dump 2019-09
, as you suggested.
For some complicated reason, I set up my AWS server in Singapore instead of US. The server instance has 8 cpus, 64 GB memory and 4 TB hard disk. Is this configuration good enough to finish the reproduction? If so, how long will it take in your opinion?
After ~20 hours of running, I have the following in the reconstruct
folder:
ubuntu@<my-hostname>:/data/cc_net/data/reconstruct/2019-09$ ll
total 169704
drwxr-xr-x 3 root root 4096 Nov 13 02:27 ./
drwxr-xr-x 3 root root 4096 Nov 12 10:50 ../
-rw-r--r-- 1 root root 8078541 Nov 12 12:46 ar_head_0001.json.gz
-rw-r--r-- 1 root root 1701686 Nov 13 07:27 tmp.af_head_0000.json.gz
-rw-r--r-- 1 root root 50175486 Nov 13 07:28 tmp.ar_head_0000.json.gz
-rw-r--r-- 1 root root 136 Nov 12 16:03 tmp.ar_head_0000.json.gz.index
-rw-r--r-- 1 root root 136 Nov 12 12:46 tmp.ar_head_0001.json.gz.index
-rw-r--r-- 1 root root 2804071 Nov 13 07:28 tmp.az_head_0000.json.gz
-rw-r--r-- 1 root root 2688051 Nov 13 07:28 tmp.be_head_0000.json.gz
-rw-r--r-- 1 root root 22772810 Nov 13 07:28 tmp.bg_head_0000.json.gz
-rw-r--r-- 1 root root 136 Nov 13 02:27 tmp.bg_head_0000.json.gz.index
-rw-r--r-- 1 root root 13951603 Nov 13 07:28 tmp.bn_head_0000.json.gz
-rw-r--r-- 1 root root 11034433 Nov 13 07:28 tmp.ca_head_0000.json.gz
-rw-r--r-- 1 root root 136 Nov 12 11:52 tmp.ca_head_0000.json.gz.index
-rw-r--r-- 1 root root 60328129 Nov 13 07:28 tmp.cs_head_0000.json.gz
-rw-r--r-- 1 root root 152 Nov 12 20:35 tmp.cs_head_0000.json.gz.index
drwxr-xr-x 2 root root 155648 Nov 13 07:28 wet_cache/
and 1585 CC-MAIN-20190224044113-20190224070113-00599.warc.wet
-like files in wet_cache
, which consumes ~200 GB hard disk.
Is this normal? There are only ~10 *_head_1000.json.gz
files and the program seems having stopped downloading and doing the corpus cleaning, but IIRC the paper says it'll generate a 3.2 TB corpus of ~100 languages. How could it generate such a clean corpus from a much smaller raw corpus?
Another issue I have to mention is, occasionally the program fails (sry that I can't find the error log right now), so I add a crontab
task to detect the failure and restart it with the original command python -m cc_net reproduce --dump 2019-09
in case of failure. Is it OK to simply restarting it? Will it continue previous computation or just start from the scratch?
Looking forward to your reply!
the server instance has 8 cpus, 64 GB memory and 4 TB hard disk
This seems enough but the 8 cores will make it slow. I'd say 4 days if everything goes fine. Using 160 CPUs it runs under 4 hours. The first thing you can do is prioritize languages. Do you want all languages or just a subset ?
The tmp.
files are quite small , 50Mb for Arabic while it should be around 4Gb.
So I think the processed stopped very early.
For the crashing, I'd bet on Out of Memory. Can you try running with --parallelism=1
and restrict to one language ?
If you have logs, can you share lines saying something like:
2019-11-13 07:08 INFO 72609:Unminifier - Processed 8 documents in 0.016h ( 0.1 doc/s).
2019-11-13 07:08 INFO 72609:Unminifier - Read 263_924, stocking 5 doc in 0.1Gb.
I observed that the memory usage can climb a bit high for languages with a lot of document because I'm doing some kind of buffering. If this is the issue, I'll look into reducing the RAM usage.
Restarting will drop all tmp
files existing but all files without tmp
will be kept.
Also all documents in wet_cache
are kept so you won't need to download them again.
The wet_cache which is the original corpus is made of *.gz files while the number reported in the paper are uncompressed.
Update:
As you mentioned above, this program will heavily use cpus, so now I'm working with a 40-core server and 64 GB memory. Because it locates in China, I changed n_retry
to 30 to let it recover from unstable network connection. Waiting for my good news.
The following session follows the previous post, but I have abandoned this approach. I'll keep it as a reference for others who want to reproduce this work.
=======================================================
Thanks for your feedback.
Do you want all languages or just a subset ?
Yes, I want the whole corpus including all languages.
I checked /var/log/syslog
and found no memory issue or kill signals, so operating system didn't kill it. Finally I was able to find the error in the program's log:
2019-11-13 22:11 INFO 9578:Unminifier - ! Missed 14 documents (0.1%) !
2019-11-13 22:11 INFO 9578:Unminifier - ! Missed 120 paragraphs (0.0%) !
2019-11-13 22:11 INFO 9572:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247481428.19/wet/CC-MAIN-20190217010854-20190217032854-00303.warc.wet.gz [200]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/data/cc_net/cc_net/execution.py", line 145, in global_fn
return f(*args[1:])
File "/data/cc_net/cc_net/minify.py", line 289, in unminify_file
jsonql.run_pipes(unminifier, file=iter(mini), output=tmp)
File "/data/cc_net/cc_net/jsonql.py", line 448, in run_pipes
for res in results:
File "/data/cc_net/cc_net/jsonql.py", line 292, in map
yield self(x)
File "/data/cc_net/cc_net/jsonql.py", line 259, in __call__
y = self.do(x)
File "/data/cc_net/cc_net/jsonql.py", line 358, in do
x = t(x)
File "/data/cc_net/cc_net/jsonql.py", line 259, in __call__
y = self.do(x)
File "/data/cc_net/cc_net/minify.py", line 193, in do
full_doc = self.retrieve_doc(segment, digest)
File "/data/cc_net/cc_net/minify.py", line 178, in retrieve_doc
with self.open_segment(segment) as f:
File "/data/cc_net/cc_net/minify.py", line 159, in open_segment
tmp.unlink()
File "/usr/lib/python3.7/pathlib.py", line 1294, in unlink
self._accessor.unlink(self)
FileNotFoundError: [Errno 2] No such file or directory: '/data/cc_net/data/reconstruct/2019-09/wet_cache/tmp_34c08f87.CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data/cc_net/cc_net/__main__.py", line 31, in <module>
main()
File "/data/cc_net/cc_net/__main__.py", line 27, in main
command(**parsed_args)
File "/data/cc_net/cc_net/minify.py", line 368, in reproduce
unminify(urls, output_dir / dump, execution, parallelism, cache_dir)
File "/data/cc_net/cc_net/minify.py", line 329, in unminify
ex(unminify_file, files, outputs, itertools.repeat(cache_dir))
File "/data/cc_net/cc_net/execution.py", line 174, in __call__
global_fn, zip(itertools.repeat(f_name), *args)
File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
FileNotFoundError: [Errno 2] No such file or directory: '/data/cc_net/data/reconstruct/2019-09/wet_cache/tmp_34c08f87.CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz'
Here's my current memory footprint:
ubuntu@<my-hostname>:/data/cc_net$ top
top - 03:19:36 up 1 day, 20:00, 1 user, load average: 8.00, 8.00, 8.00
Tasks: 174 total, 9 running, 165 sleeping, 0 stopped, 0 zombie
%Cpu(s): 99.2 us, 0.7 sy, 0.0 ni, 0.0 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65332224 total, 356440 free, 20029108 used, 44946676 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 44719176 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14522 root 20 0 632756 113200 5368 R 100.0 0.2 306:00.58 python
14523 root 20 0 6807276 5.995g 5436 R 100.0 9.6 306:45.40 python
14526 root 20 0 2641508 2.023g 5432 R 100.0 3.2 307:10.08 python
14525 root 20 0 728648 209500 5776 R 99.7 0.3 305:58.95 python
14528 root 20 0 2168852 1.572g 5432 R 99.7 2.5 307:02.03 python
14529 root 20 0 7868240 7.006g 5436 R 99.7 11.2 306:36.43 python
14527 root 20 0 2272288 1.671g 5432 R 99.3 2.7 307:14.14 python
14524 root 20 0 993484 473840 5432 R 99.0 0.7 306:18.47 python
63 root 20 0 0 0 0 S 0.3 0.0 4:27.24 kswapd0
1 root 20 0 37964 5232 3228 S 0.0 0.0 0:05.33 systemd
Here's some log right before the crash as you requested:
2019-11-13 22:10 INFO 9575:cc_net.process_wet_file - Kept 43_893 documents over 44_765 (98.1%).
2019-11-13 22:10 INFO 9575:Unminifier - Processed 2_689 documents in 1.2e+01h ( 0.1 doc/s).
2019-11-13 22:10 INFO 9575:Unminifier - Read 69_062_621, stocking 1_094 doc in 0.5Gb.
2019-11-13 22:10 INFO 9575:Unminifier - ! Missed 11 documents (0.4%) !
2019-11-13 22:10 INFO 9575:Unminifier - ! Missed 274 paragraphs (0.1%) !
2019-11-13 22:10 INFO 9575:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479627.17/wet/CC-MAIN-
20190215224408-20190216010408-00088.warc.wet.gz
2019-11-13 22:10 INFO 9573:cc_net.process_wet_file - Kept 43_502 documents over 44_286 (98.2%).
2019-11-13 22:10 INFO 9573:Unminifier - Processed 90_076 documents in 1.2e+01h ( 2.1 doc/s).
2019-11-13 22:10 INFO 9573:Unminifier - Read 84_485_662, stocking 54_562 doc in 7e+00Gb.
2019-11-13 22:10 INFO 9573:Unminifier - ! Missed 21 documents (0.0%) !
2019-11-13 22:10 INFO 9573:Unminifier - ! Missed 291 paragraphs (0.0%) !
2019-11-13 22:10 INFO 9573:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-
20190215183319-20190215205319-00491.warc.wet.gz
2019-11-13 22:10 INFO 9574:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-2019021518
3319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9579:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9576:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9578:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9577:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz [200]
2019-11-13 22:11 INFO 9578:JsonReader - Processed 23_367 documents in 1.2e+01h ( 0.6 doc/s).
2019-11-13 22:11 INFO 9578:Unminifier - Processed 23_366 documents in 1.2e+01h ( 0.6 doc/s).
2019-11-13 22:11 INFO 9578:Unminifier - Read 84_485_662, stocking 14_523 doc in 2e+00Gb.
2019-11-13 22:11 INFO 9578:Unminifier - ! Missed 14 documents (0.1%) !
2019-11-13 22:11 INFO 9578:Unminifier - ! Missed 120 paragraphs (0.0%) !
2019-11-13 22:11 INFO 9572:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-09/segments/1550247481428.19/wet/CC-MAIN-20190217010854-20190217032854-00303.warc.wet.gz [200]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/data/cc_net/cc_net/execution.py", line 145, in global_fn
return f(*args[1:])
File "/data/cc_net/cc_net/minify.py", line 289, in unminify_file
jsonql.run_pipes(unminifier, file=iter(mini), output=tmp)
File "/data/cc_net/cc_net/jsonql.py", line 448, in run_pipes
for res in results:
File "/data/cc_net/cc_net/jsonql.py", line 292, in map
yield self(x)
File "/data/cc_net/cc_net/jsonql.py", line 259, in __call__
y = self.do(x)
File "/data/cc_net/cc_net/jsonql.py", line 358, in do
x = t(x)
File "/data/cc_net/cc_net/jsonql.py", line 259, in __call__
y = self.do(x)
File "/data/cc_net/cc_net/minify.py", line 193, in do
full_doc = self.retrieve_doc(segment, digest)
File "/data/cc_net/cc_net/minify.py", line 178, in retrieve_doc
with self.open_segment(segment) as f:
File "/data/cc_net/cc_net/minify.py", line 159, in open_segment
tmp.unlink()
File "/usr/lib/python3.7/pathlib.py", line 1294, in unlink
self._accessor.unlink(self)
FileNotFoundError: [Errno 2] No such file or directory: '/data/cc_net/data/reconstruct/2019-09/wet_cache/tmp_34c08f87.CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz'
"""
So I guess the memory footprint is just fine, but I missed some raw corpus during downloading process? Is this related to restarting?
Now I have ~3k files in the wet_cache
folder:
ubuntu@<my-hostname>:/data/cc_net$ ll data/reconstruct/2019-09/wet_cache | wc -l
3068
Is it correct? I remember the paper says each dump is divided into 1600 shards.
File "/data/cc_net/cc_net/minify.py", line 178, in retrieve_doc
with self.open_segment(segment) as f:
File "/data/cc_net/cc_net/minify.py", line 159, in open_segment
tmp.unlink()
File "/usr/lib/python3.7/pathlib.py", line 1294, in unlink
self._accessor.unlink(self)
FileNotFoundError: [Errno 2] No such file or directory: '/data/cc_net/data/reconstruct/2019-09/wet_cache/tmp_34c08f87.CC-MAIN-20190215183319-20190215205319-00491.warc.wet.gz'
I think you caught a subtle race condition. I've pushed 07e66aa6 to fix this. Sorry for the bug. Can you try with this commit ?
Now I have ~3k files in the wet_cache folder. Is it correct?
CommonCrawl is released in 64k "segments" (numbers may vary from one dump to another). In the paper we group those segments by 40 to have 1600 "shards" of around 4Gb, this make computation more efficient since we have big models to load for eah process.
The corpus we released is also split in shard of 4Gb, but those are grouped by language and quality not by their original segment.
So wet_cache
should have 64k files at the end.
Wow, how lucky I am! This is not quite rare: I encounter it 3 to 4 times a day. I'll try the new code.
I'm still working with some network issues (latency, bindwidth, etc), so no progress yet. Btw, I checked the CommonCrawl website and found the total size of wet files in 2019-09 dump is 7.62 TB. If the program caches all of them on the local disk, a 4 TB hard disk seems to be inadequate. So what's the minimum requirement for hard disk? Maybe something like 15 TB (7.6 T raw wet files + 3.2 T clean corpus + some safety margin)?
@soloice I also face a similar problem as yours, my solution is using gcsfuse ( AWS equivalent will be s3fs ) to mount a remote cloud storage to store all the wet files while storing the tmp and clean corpus on a mounted disk.
My currently setup was mounting a Google cloud storage bucket as the wet_cache. I think you can do this as well using AWS S3 and mount it using s3fs. Also I also recommend to download wet files on a server within common crawl region( US I suppose? ) as the download speed was much faster to reduce idle process waiting for the download to finished.
Have you faced any memory error when processing large corpus such as english or german? 64G of memory doesn't seems enough to me. I ended up adding a large swap and have to restart my whole process again.
@theblackcat102 Hi, I didn't encounter any memory issue. This might be due to the short continuous running time (I have to restart my process every 4~12 hours because of network connection/file opening/etc errors).
I agree with you that the best practice is to run the program (both computation & storage) within US.
Reproducing this work is just a side project for me (I'm helping a data analyst colleague with this) and I have something more important to do now, so I'm not going to spend more time on this work within a month. Probably I'll come back a month later.
Did you manage to run the code in the end ? If you have some more tips you can share with future users of the repository that would be great and we could add them to the README.
Thanks for your following up. I didn't work on this project any further, so ...... But there's one thing I'm pretty sure: running the program inside U.S. can avoid many issues.
@gwenzek I written a post with tips how to recreate this in GCP. Basically use S3 or Google cloud bucket and mount them as disk will save you a lot of storage fees
Here's the log with command
nohup python -m cc_net mine --dump 2019-13 > 2019-13.log 2>2019-13.err &
:Is this just due to poor network connection between me and Amazon server (I'm in China)? If so, is it recommended to run the code from an AWS server located in US? If I don't have a C++17 compiler, how much memory do I need? Thanks a lot.