Tools to download and clean Common Crawl as introduced in our paper CCNet.
If you found these resources useful, please consider citing:
@inproceedings{wenzek2020ccnet,
title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={4003--4012},
year={2020}
}
We only tried this on Linux but installation should be possible on MacOS too.
Create or simlink a data
folder to where you want to download the corpus.
Run make install
. This will download some resources and install required packages.
If you have a C++ 17 compiler you can also run
pip install .[getpy]
, it provides more memory efficient hashset.
Install the following tools manually if make install
failed:
lmplz
and build_binary
from KenLMspm_train
and spm_encode
from Sentence PieceThe Makefile
is used to train Sentence Piece and LM on Wikipedia data.
make help
shows helpmake lang=de lm
trains a Sentence Piece and a LM on German Wikipediamake all_lm
trains the same model than in the papermake lang=de dl_lm
downloads the LM trained for the papermake dl_all_lm
downloads all of themThe full mining pipeline is divided in 3 steps:
hashes
downloads one Common-Crawl snapshot, and compute hashes for each paragraphmine
removes duplicates, detects language, run the LM and split by lang/perplexity bucketsregroup
regroup the files created by mine
in chunks of 4GbEach step needs the previous step to be over before starting.
You can launch the full pipeline using python -m cc_net
.
python -m cc_net --help
shows helppython -m cc_net --dump 2019-13
treats a specific snapshotpython -m cc_net -l my -l gu
restricts to specific languagespython -m cc_net --lm_dir my_lms/
uses custom LMspython -m cc_net --lang_threshold 0.3
set a specific field in mine.Config
python -m cc_net --config test
runs on a tiny subset of a snapshotpython -m cc_net --config config/my_config.json
uses configuration from the given config fileGiven the CPU required to run the full pipeline on such a big corpus we share a mapping from url to the information we computed. You can reconstruct the corpus used in the paper by using:
python -m cc_net --conf reproduce --dump 2019-09
Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa) paper was trained on data extracted by an internal version of cc_net.
Due to the format being a little bit different please use the following command instead:
python cc_net/tools/dl_cc_100.py --help
python cc_net/tools/dl_cc_100.py --outdir data_cc100 --process 8
If you use this version of the data please also consider citing:
@article{conneau2019unsupervised,
title={Unsupervised Cross-lingual Representation Learning at Scale},
author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
journal={arXiv preprint arXiv:1911.02116},
year={2019}
}
Given the computation cost of running the full pipeline we distributed the computation
on a Slurm cluster using submitit.
submitit
will default to spawning processes on your machine if Slurm cluster is found.
You should tweak --task_parallelism
to something adapated to your machine.
Defaults are 512 for mining and 20 for reproducing.
To run the tasks in-process use --execution debug
.
Generated files are compressed JSON files. There is one JSON object per line.
List of fields:
Sample JSON object:
{
"url": "http://www.pikespeakhospice.org/members/1420",
"date_download": "2019-02-15T18:40:25Z",
"digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
"length": 752,
"nlines": 5,
"source_domain": "www.pikespeakhospice.org",
"title": "LeeRoy Aragon",
"raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
"original_nlines": 7,
"original_length": 754,
"language": "en",
"language_score": 0.99,
"perplexity": 255.11,
}
You can peak at those files using UNIX tools zcat
and jq
, eg:
zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .
jq
can do some complicated filtering.
jsonql.py
provides a Python API with multiprocess support to do more complicated operations like LM scoring of the document.
By contributing to cc_net
, you agree that your contributions will be licensed
under the LICENSE file in the root directory of this source tree.