facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
972 stars 142 forks source link

The final json files are not as expected #44

Open nengyinyibeiwu opened 1 year ago

nengyinyibeiwu commented 1 year ago

python3 -m cc_net --config config/test_segment.json

finally:

Regrouped test_data3/mined_by_lang/2019-09/en_head_0000.json.gz (1 / 3) Regrouped test_data3/mined_by_lang/2019-09/en_tail_0000.json.gz (2 / 3) Regrouped test_data3/mined_by_lang/2019-09/en_middle_0000.json.gz (3 / 3)

but json files are not cleaned-up documents, they are:

{"url": "http://3stepbreath.com/shikhandin.html", "digest": "sha1:TNOFSVGSL4OE4F3JZKAMAAXW2VA5KORA", "cc_segment": "crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00000.warc.wet.gz", "language": "en", "language_score": 1.0, "perplexity": 296.6, "bucket": "head", "line_ids": "JwAoACkAKgArAA=="} {"url": "http://911forum.org.uk/board/viewtopic.php?p=175455", "digest": "sha1:HTWRWQKQPGOAPRU3KXF6XWXUIFJIE2GE", "cc_segment": "crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00000.warc.wet.gz", "language": "en", "language_score": 0.96, "perplexity": 266.5, "bucket": "head", "line_ids": "AAABAAIAAwAIAAsADAANAA4ADwAQABEAEgATABQAFQAWABcAGAAZABoAGwAdACQAJQAmACcAKAApACoAKwAsAC0ALgAvADAAMQAyADwAPQA+AEcATABNAE4ATwBTAFcAWABZAFoAWwBcAF0AXgBfAGAAYgBjAGQAZQBmAGcAaABpAGoAawBsAG0AbgBvAHAAcQByAHMAdAB1AHYAdwB4AHoAewB8AH0AfgB/AIAAgQCCAIMAhACFAIYAhwCIAIkAigCLAJIAkwCUAJUAlgCXAJgAmQCaAJsAnACdAJ4AnwCgAKEAogC8AL0AvgC/AMAAwQDCAMMAxADFAMYAxwDIAMkAygDLAMwAzQDOAM8A0ADRANIA0wDUANUA1gDXANgA2QDaANsA3ADdAN4A3wDoAOkA6gDrAOwA9QD7APwA/QD+AAcBCAEJAQoBCwEMAQ0BEgEVAQ=="} {"url": "http://965kvki.com/no-more-saturday-postal-service/", "digest": "sha1:M2WX6RZ3E3KISXFCB7GXCMU432RKIYWO", "cc_segment": "crawl-data/CC-MAIN-2019-09/segments/1550247479101.30/wet/CC-MAIN-20190215183319-20190215205319-00000.warc.wet.gz", "language": "en", "language_score": 0.94, "perplexity": 251.9, "bucket": "head", "line_ids": "VABVAFgAWgBbAFwA"}

why? I just want clean corpus.