Closed allhelllooz closed 4 years ago
I read in the issues that we should keep dictionary ~50k. Why this should be a limit ? Initially when I ran with combined dict size ~600k everything ran properly with 1024 token even. Now it is 3.6mil ... is it a problem ?? Can multi-gpu solve this or I need larger gpu to train this (like 24 or 32gb vram gpu)?? Can someone give me suggestions ! Thanks.
Do you mean that you have a vocabulary size of 3.6 million? That would be a problem. How did you preprocess your data? Did you use BPE to segment the data?
Yes. I did use BPE and still dict sizes are as follows
| [mr] dictionary: 2691064 types
| [en] dictionary: 922896 types
Problem is moses
or nltk
tokenizers can't break words in devnagari script well. It works for Roman scripts. Bert-Tokenizer is another option or I can go character-level ... model size will be much less but not sure how much time/accuracy it would be required.
Anyways using BPE with moses still gives 922896
dict values for English which is also huge.
Do you recommend anything else ?
ahhh got it. I used only --tokenizer
option ... not --bpe
which has sentencepiece, subword_nmt, fastbpe, gpt2
options to choose from. Which one should I go for ??
I will try and update here about model sizes in my case. Thanks @bricksdont for pointing out.
All options are fine. But if I remember correctly, GPT produces a byte-level vocabulary, which is quite different from the other options.
I would recommend to use sentencepiece, it might even work without tokenization.
Vocabulary size can be quite low, 32k is a reasonable value to try first.
Found these strange results. I used bert-tokenizer from huggingface and then used preprocess without bpe option got following dict sizes.
| [mr] Dictionary: 22967 types
| [mr] /data/translation_models/marathi_english_translation/mr_en_token/raw_data/train.mr: 2024847 sents, 95037672 tokens, 0.0% replaced by <unk>
| [en] Dictionary: 28167 types
| [en] /data/translation_models/marathi_english_translation/mr_en_token/raw_data/train.en: 2024847 sents, 51040154 tokens, 0.0% replaced by <unk>
Now my model has ~50 million params and takes 7-8gb gpu memory. Good enough to work with.
Then I used sentencepiece with moses
and subword-nmt with moses
and got following dict sizes.
| [mr] Dictionary: 2691063 types
| [mr] /data/translation_models/marathi_english_translation/mr_en_new/raw_data/train.mr: 2024847 sents, 32508211 tokens, 0.0% replaced by <unk>
| [en] Dictionary: 922895 types
| [en] /data/translation_models/marathi_english_translation/mr_en_new/raw_data/train.en: 2024847 sents, 36592253 tokens, 0.0% replaced by <unk>
With this model has ~2 billion params and takes >15gb gpu memory. Not sure why dict size didn't decrease in case of sentencepiece or subword-nmt !!
Can someone update over this issue ?? Seems like a bug !!
Can you share more details about how you ran sentencepiece or subword-nmt?
This script should be a useful reference for learning a vocab with sentencepiece: https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-iwslt17-multilingual.sh#L100-L126
I am using workpiece for now and its working well. Will take a look at above script soon and update. Currently closing the issue. Thanks buddy.
@myleott @allhelllooz @bricksdont @ynd
Hello. I am currently using Fairseq for Grammatical Correction with translation models. I could run the translation models like fconv and transformer on Google Colab. But, when I switched over to my gaming laptop, I encountered some problem with training the models with fairseq.
Currently, when i run the script. python train.py data-bin/lang-8-fairseq2 --save-dir checkpoints/lang-8-fairseq-transformer2 --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --max-tokens 4096 --fp16
It gives me this error.
I do not know, how it is overflowing and how to get run? Can somebody help me with this? Any help would be appreciated. Thanks in advance.
Getting following error on p3.2xlarge single 16gb gpu machine. It ran properly when I had ~400k sentence pairs. I added more data .. now ~2000k sentence pairs and its not running. I trimmed down sentences >64 tokens and set
max-tokens
to 64 as well still not working out.already allocated
andcached
memory ??I am running following command
Error logs
Following are other initial logs
and