facebookresearch / cc_net

Tools to download and cleanup Common Crawl data
MIT License
972 stars 142 forks source link

Can reproduce still run normally? #51

Closed newbietuan closed 1 year ago

newbietuan commented 1 year ago

hi, there Can reproduce still run normally? When I run it, the message is Will run cc_net.mine.main with the following config: Config(config_name='reproduce', dump='2019-09', output_dir=PosixPath('data'), mined_dir='reproduce', execution='auto', num_shards=1600, min_shard=-1, num_segments_per_shard=-1, metadata='https://dl.fbaipublicfiles.com/cc_net/1.0.0', min_len=300, hash_in_mem=50, lang_whitelist=['zh'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=['head', 'all'], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/home/yutuan.ma/RedPajama-Data-main/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=1, target_size='4G', cleanup_after_regroup=False, task_parallelism=-1, pipeline=['fetch_metadata', 'keep_lang', 'keep_bucket', 'split_by_lang'], experiments=[], cache_dir=None) Submitting 1600 jobs for _mine_shard, with task_parallelism=64 Waiting on 64 running jobs. Job ids: 43439,43506,43573,43640... but nothing happens, the same as change the lang_whitelist='en'