Open we29758143 opened 3 years ago
How did you prepare it in the first place?
I am training a Chinese sentence corrector model, there will be four text files(train.trg, train.src, valid.trg, valid.src). Each file contains multiple lines. I am using fairseq/fairseq_cli/preprocess.py to generate bin file. I am wondering is there any parameter that I can pass into it to reduce the size of the dictionary?
It worked, but the dictionary size was just too big.
Thank you for answering.
I would suggest applying BPE (you can try sentencepiece for example) to your text before preprocessing. This will allow you to set a --vocab_size
flag that will limit the size of your dictionary.
Is there any argument right here that can allow me to reduce the size of the dictionary? fairseq-preprocess [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}] [--tokenizer {nltk,space,moses}] [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}] [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}] [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N] [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary] [--only-source] [--padding-factor N] [--workers N]
I set a limit on both [--nwordstgt N] [--nwordssrc N] = 45000 and 13000, but both of them produce the same dictionary size. Can you tell me where I did wrong?
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
How can I reduce the size of the dictionary? Is there any parameter that I can set up for? If so, how can I pass a reasonable number if I don't know its original dictionary size?
Thanks in advance!!!