ZhiYuanZeng / fairseq-moe

6 stars 0 forks source link

preprocessing problem #1

Open zhengkid opened 1 year ago

zhengkid commented 1 year ago

I cannot find CorpusManger.py in this project. How can I get it?

ZhiYuanZeng commented 1 year ago

The preprocessing pipeline of the opus-100 dataset in this project follows "https://github.com/cordercorder/nmt-multi". I am sorry to forget to mention it in the readme. I will add that soon.

zhengkid commented 1 year ago

The preprocessing pipeline of the opus-100 dataset in this project follows "https://github.com/cordercorder/nmt-multi". I am sorry to forget to mention it in the readme. I will add that soon.

Thanks, I will try it.

ZhiYuanZeng commented 1 year ago

Feel free to ask questions if you meet any problems while reimplementing

zhengkid commented 1 year ago

Hi, I still cannot run this code. I utilize the following training script to train the model:

python3 -u train.py data-bin/$data_dir
  --distributed-world-size $gpu_num
  --arch $arch
  --optimizer adam --clip-norm 0.0
  --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates $warmup
  --lr $lr --min-lr 1e-09
  --weight-decay $weight_decay
  --criterion $criterion --label-smoothing 0.1
  --max-tokens $max_tokens
  --update-freq $update_freq
  --no-progress-bar
  --log-interval 100
  --ddp-backend no_c10d
  --sampling-method "temperature"
  --sampling-temperature 1.5
  --encoder-langtok src
  --decoder-langtok
  --seed 1
  --task translation_multi_simple_epoch
  --lang-pairs en-fr,cy-en,hu-en,en-lt,en-mg,yi-en,as-en,en-mr,uz-en,eo-en,li-en,es-en,ka-en,am-en,en-he,en-ja,nb-en,en-ku,en-cs,en-fi,si-en,en-no,en-se,az-en,en-ga,da-en,en-vi,eu-en,en-pa,ca-en,id-en,en-eu,cs-en,kn-en,te-en,en-ug,en-be,rw-en,gu-en,en-cy,en-tt,en-am,xh-en,en-nb,sv-en,sq-en,en-nn,en-bn,ha-en,en-hu,en-pl,en-ko,en-tg,en-zu,en-nl,ps-en,af-en,be-en,ga-en,mg-en,en-mt,bs-en,or-en,bn-en,en-sr,tg-en,hi-en,fr-en,se-en,en-hr,en-eo,en-de,en-it,sk-en,tt-en,is-en,km-en,en-br,nn-en,vi-en,en-ka,ne-en,en-et,ro-en,en-ha,fa-en,oc-en,en-sh,ko-en,en-yi,en-fa,it-en,no-en,en-ig,en-af,en-da,en-th,ur-en,en-pt,zu-en,ja-en,zh-en,ar-en,en-ky,fi-en,en-mk,lv-en,my-en,en-kk,ta-en,en-ca,mt-en,fy-en,en-uk,th-en,el-en,ml-en,et-en,en-my,en-es,en-sv,wa-en,en-sk,en-ro,en-oc,bg-en,en-uz,tr-en,sl-en,sh-en,de-en,en-lv,en-is,en-km,mr-en,en-hi,pa-en,en-gu,hr-en,en-tk,en-ta,pl-en,en-kn,lt-en,en-ps,ug-en,en-bg,br-en,en-ru,en-sl,en-ne,en-te,en-bs,tk-en,gl-en,en-si,en-rw,sr-en,pt-en,en-tr,ky-en,en-gd,ku-en,en-id,en-ur,en-li,uk-en,en-or,en-sq,gd-en,en-ar,en-ml,kk-en,en-el,en-zh,en-gl,en-as,ig-en,ms-en,nl-en,en-fy,en-az,he-en,en-ms,ru-en,mk-en,en-wa,en-xh
  --lang-dict data-bin/$data_dir/lang_list.txt
  --save-dir $save_dir
  --keep-last-epochs $keep_last_epochs
  --tensorboard-logdir $save_dir

 but I obtain the errors as follows:

 Traceback (most recent call last):
  File "train.py", line 14, in <module>
    cli_main()
  File "/home/v-lbei/fairseq-msra/fairseq_cli/train.py", line 357, in cli_main
    distributed_utils.call_main(args, main)
  File "/home/v-lbei/fairseq-msra/fairseq/distributed_utils.py", line 283, in call_main
    torch.multiprocessing.spawn(
  File "/home/v-lbei/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/v-lbei/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/v-lbei/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 7 terminated with the following error:
Traceback (most recent call last):
  File "/home/v-lbei/miniconda3/envs/fairseq/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/v-lbei/fairseq-msra/fairseq/distributed_utils.py", line 270, in distributed_main
    main(args, **kwargs)
  File "/home/v-lbei/fairseq-msra/fairseq_cli/train.py", line 65, in main
    task.load_dataset(valid_sub_split, combine=False, epoch=1)
  File "/home/v-lbei/fairseq-msra/fairseq/tasks/translation_multi_simple_epoch.py", line 146, in load_dataset
    self.datasets[split] = self.data_manager.load_sampled_multi_epoch_dataset(
  File "/home/v-lbei/fairseq-msra/fairseq/data/multilingual/multilingual_data_manager.py", line 1023, in load_sampled_multi_epoch_dataset
    datasets, data_param_list = self.load_split_datasets(
  File "/home/v-lbei/fairseq-msra/fairseq/data/multilingual/multilingual_data_manager.py", line 988, in load_split_datasets
    data_param_list = self.get_split_data_param_list(
  File "/home/v-lbei/fairseq-msra/fairseq/data/multilingual/multilingual_data_manager.py", line 902, in get_split_data_param_list
    paths, epoch, shard_epoch, split_num_shards_dict[key]
KeyError: 'main:en-fr'

Can you help figure it out?

ZhiYuanZeng commented 1 year ago

I am sorry for the delayed response. Have you tried adding --source-dict and --target-dict into the arguments? You can follow this training script: https://github.com/ZhiYuanZeng/fairseq-moe/blob/main/train_scripts/scomoe/train_base_model_on_opus100.sh.

Another issue is that if you want to train an moe model with fairseq, you need to set the --ddp-backend to fully_sharded or legacy_ddp