facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.26k stars 6.38k forks source link

How to reduce the size of the dictionary? #3296

Open we29758143 opened 3 years ago

we29758143 commented 3 years ago

How can I reduce the size of the dictionary? Is there any parameter that I can set up for? If so, how can I pass a reasonable number if I don't know its original dictionary size?

Thanks in advance!!!

lematt1991 commented 3 years ago

How did you prepare it in the first place?

we29758143 commented 3 years ago

I am training a Chinese sentence corrector model, there will be four text files(train.trg, train.src, valid.trg, valid.src). Each file contains multiple lines. I am using fairseq/fairseq_cli/preprocess.py to generate bin file. I am wondering is there any parameter that I can pass into it to reduce the size of the dictionary?

It worked, but the dictionary size was just too big.

Thank you for answering.

lematt1991 commented 3 years ago

I would suggest applying BPE (you can try sentencepiece for example) to your text before preprocessing. This will allow you to set a --vocab_size flag that will limit the size of your dictionary.

we29758143 commented 3 years ago

Is there any argument right here that can allow me to reduce the size of the dictionary? fairseq-preprocess [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--criterion {sentence_prediction,ctc,adaptive_loss,label_smoothed_cross_entropy,composite_loss,nat_loss,masked_lm,sentence_ranking,legacy_masked_lm_loss,cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}] [--tokenizer {nltk,space,moses}] [--bpe {gpt2,bytes,sentencepiece,subword_nmt,byte_bpe,characters,bert,fastbpe,hf_byte_bpe}] [--optimizer {adadelta,adam,adafactor,adagrad,lamb,nag,adamax,sgd}] [--lr-scheduler {cosine,reduce_lr_on_plateau,fixed,triangular,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {chrf,wer,sacrebleu,bleu}] [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N] [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary] [--only-source] [--padding-factor N] [--workers N]

I set a limit on both [--nwordstgt N] [--nwordssrc N] = 45000 and 13000, but both of them produce the same dictionary size. Can you tell me where I did wrong?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!