OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.67k stars 2.24k forks source link

Issues with Custom SentencePiece Models and Pretrained Embeddings in Training #2582

Closed HURIMOZ closed 2 months ago

HURIMOZ commented 2 months ago

Hello OpenNMT-py Community, I've been working on training a bilingual model using OpenNMT-py and have encountered some challenges related to using custom SentencePiece models and pretrained embeddings. I'm seeking guidance or suggestions on how to resolve these issues.

Background: I'm training a bilingual translation model with the transformer architecture. Given the linguistic characteristics of the target language, I initially attempted to implement custom tokenization rules using SentencePiece.

Issue: Despite following the documentation and ensuring the config.yaml file is correctly set up for using SentencePiece models (src_spm.model and tgt_spm.model) and vocabularies (src_spm.vocab and tgt_spm.vocab), I encountered an error when running the onmt_train command: onmt_train: error: the following arguments are required: -src_vocab/--src_vocab

After explicitly specifying the -src_vocab and -tgt_vocab arguments, I encountered another error related to the use of pretrained embeddings: AssertionError: -save_data should be set if use pretrained embeddings.

Configuration: OpenNMT-py version: 3.5.1 Model architecture: Transformer Tokenization: SentencePiece subword tokenization for both source and target languages Pretrained embeddings for the source language

Questions:

Are there specific configurations or steps required to use pretrained embeddings with SentencePiece tokenization in OpenNMT-py?

Any insights or suggestions from the community would be greatly appreciated.