facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
964 stars 139 forks source link

Early exit when desired number of documents is reached? #5

Closed JohnGiorgi closed 4 years ago

JohnGiorgi commented 4 years ago

Apologies if this is mentioned somewhere or is otherwise obvious, but:

Is there a way to early-exit when a desired number of documents have been collected? Say I only wanted 1 million documents, can I somehow exit the call to python cc_net mine once I have hit that number?

Thanks a lot in advance.

gwenzek commented 4 years ago

Hi John, this is not implemented as is but you can do something similar.

For the mining there is no way of specifying a number of documents but you can chose the number of "segments" to use. A segment is 64Mb of compressed text and you have around 64k of them in each CC snapshot (number varies).

In your config.json you can set

"num_shards": 1,
"num_segments_per_shard": 40 

This will treat a small fraction of the CC dataset. Note that having volume helps the deduplication step to remove boilerplate which impacts the language identification itself has well as the language modelling. Though I wouldn't recommend it apart for testing.

The number of documents you will get will depend on a number of other parameters, namely the languages you want and the LM score you're using.

If you're not interested in running the full mining pipeline, you can look call reproduce with a specific shard id {lang}_{bucket}_{id} (eg: {en_head_0000}) which would be faster and you can also hack the run_pipes to return early.

I should probably add a limit parameter somewhere though.

JohnGiorgi commented 4 years ago

Okay, so sounds like you recommend the second option? If I use reproduce with a specific shard ID, will I get a collection of text that is grouped in some way (e.g by topic or otherwise)? I want to avoid training on a dataset that contains too many similar texts.

I will try this for now but do let me know if you add the limit param! I think that would be useful for anyone who wants a big, unlabelled dataset but doesn’t quite need all of CC.

gwenzek commented 4 years ago

will I get a collection of text that is grouped in some way

The text will be grouped by language and "perplexity". The perplexity is biased toward wikipedia-like content so you're missing some topic, like very specialized forum (but mostly porn).