OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.67k stars 2.24k forks source link

Canʻt get past Sentencepiece subword tokenization with pretrained embeddings #2581

Closed HURIMOZ closed 2 months ago

HURIMOZ commented 2 months ago

Hi, Iʻm building a bilingual translation model (Transformer) with SentencePiece subword tokenization for both source and target data, and with subword pretrained embeddings for the source data.

The system will not be happy with a simple command like onmt_train -config config.yaml as it will throw this error: onmt_train: error: the following arguments are required: -src_vocab/–src_vocab even though the config file is correctly pointing at those vocab files generated by sentencepiece.

So I try command onmt_train -config config.yaml -src_vocab data/src_spm.vocab -tgt_vocab data/tgt_spm.vocab -gpu_ranks 0 but then I get this error: AssertionError: -save_data should be set if use pretrained embeddings

So I tried to rebuild vocabularies with the onmt_build_vocab module: onmt_build_vocab -config config.yaml -n_sample 80000 -save_data data/processed -src_vocab data/src_spm.vocab -tgt_vocab data/tgt_spm.vocab but again I got stuck with this error: raise IOError(f"path {path} exists, stop.") OSError: path data/src_spm.vocab exists, stop.

Given that I have already trained my SentencePiece models and vocabs I should not need to run onmt_build_vocab to create separate vocabulary files again. The SentencePiece models (src_spm.model and tgt_spm.model) and their corresponding vocabularies (src_spm.vocab and tgt_spm.vocab) should suffice for training, right?

Any help welcome! Thank you, Tamatoa