how to go about using the repository from scratch.

kr-sundaram commented 4 years ago

Thanks for making the repo public. I am new to the machine translation and your repo seems promising to me. could you please explain about how to go about train and evaluate the model for my datasets.

cordercorder commented 4 years ago

There are some parameters to be set before training, as file multi_gpu_train.sh and train.sh show. If you want to use single computer and multiple GPUs to train this model, you can run following command:

bash multi_gpu_train.sh

If you want to use single computer and single GPU for training, please make sure your current working directory is $your_path/NMT/ and run following command:

bash trainer/train.sh

Here is an example command with some descriptions of each parameter which can be used for multi GPU training.

python -m torch.distributed.launch --nproc_per_node=3 multi_gpu_train.py \
    --device_id 1 2 3 \
    # device_id is the id of GPU that used for training

    --src_language combine \
    # name of source language

    --tgt_language en \
    # name of target language

    --src_path /data/rrjin/corpus_data/lang_vec_data/bible-corpus/train_data/train_src_combine_bpe_32000.txt \
    # location of corpus of source language

    --tgt_path /data/rrjin/corpus_data/lang_vec_data/bible-corpus/train_data/train_tgt_en_bpe_32000.txt \
    # location of corpus of target language

    --src_vocab_path /data/rrjin/NMT/data/src_combine_32000.vocab \
    # location where the vocabulary of source language stores, the vocabulary will be automaticly generated according to the corpus

    --tgt_vocab_path /data/rrjin/NMT/data/tgt_en_32000.vocab \
   # location where the vocabulary of target language stores, the vocabulary will be automaticly generated according to the corpus

    --rnn_type lstm \
    # the kind of RNN used in encoder and decoder, it can be "rnn" or "gru" or "lstm"

    --embedding_size 512 \
    # size of word embedding

    --hidden_size 512 \
    # hidden size of RNN in the encoder

    --num_layers 3 \
    # number of layers of RNN in encoder and decoder. for example, if ${num_layers} is 3, the encoder and decoder have 3 recurrent layers separately

    --checkpoint /data/rrjin/NMT/data/models/basic_multi_gpu_lstm \
    # prefix of the path where the model after training saves

    --batch_size 32 \
    # number of sentence to be processing per training step

    --dropout 0.2 \
    # probability of an element to be zeroed in RNN

    --rebuild_vocab \
    # build vocabulary for corpus

    --normalize
    # preprocess the sentence, please refer `normalizeString` function in `NMT/utils/process.py` for more details

After training, you can run eval.py or quick_eval.py for translating (just run eval.sh or quick_eval.sh for simplicity). The difference between eval.py and quick_eval.py is that beam search is used in eval.py when decoding and greedy decoding is used in quick_eval.py

kr-sundaram commented 4 years ago

Thank you so much for your help! I will let you know if need anything.

cordercorder / NMT

how to go about using the repository from scratch. #1