Hello OpenNMT-py Community,
I've been working on training a bilingual model using OpenNMT-py and have encountered some challenges related to using custom SentencePiece models and pretrained embeddings. I'm seeking guidance or suggestions on how to resolve these issues.
Background:
I'm training a bilingual translation model with the transformer architecture. Given the linguistic characteristics of the target language, I initially attempted to implement custom tokenization rules using SentencePiece.
Issue:
Despite following the documentation and ensuring the config.yaml file is correctly set up for using SentencePiece models (src_spm.model and tgt_spm.model) and vocabularies (src_spm.vocab and tgt_spm.vocab), I encountered an error when running the onmt_train command:
onmt_train: error: the following arguments are required: -src_vocab/--src_vocab
After explicitly specifying the -src_vocab and -tgt_vocab arguments, I encountered another error related to the use of pretrained embeddings:
AssertionError: -save_data should be set if use pretrained embeddings.
Configuration:
OpenNMT-py version: 3.5.1
Model architecture: Transformer
Tokenization: SentencePiece subword tokenization for both source and target languages
Pretrained embeddings for the source language
Questions:
How can I resolve the errors encountered when using custom SentencePiece models and pretrained embeddings with OpenNMT-py?
Is there a recommended approach to integrating custom tokenization rules with OpenNMT-py's built-in SentencePiece functionality?
Are there specific configurations or steps required to use pretrained embeddings with SentencePiece tokenization in OpenNMT-py?
Any insights or suggestions from the community would be greatly appreciated.
Hello OpenNMT-py Community, I've been working on training a bilingual model using OpenNMT-py and have encountered some challenges related to using custom SentencePiece models and pretrained embeddings. I'm seeking guidance or suggestions on how to resolve these issues.
Background: I'm training a bilingual translation model with the transformer architecture. Given the linguistic characteristics of the target language, I initially attempted to implement custom tokenization rules using SentencePiece.
Issue: Despite following the documentation and ensuring the config.yaml file is correctly set up for using SentencePiece models (src_spm.model and tgt_spm.model) and vocabularies (src_spm.vocab and tgt_spm.vocab), I encountered an error when running the onmt_train command:
onmt_train: error: the following arguments are required: -src_vocab/--src_vocab
After explicitly specifying the -src_vocab and -tgt_vocab arguments, I encountered another error related to the use of pretrained embeddings:
AssertionError: -save_data should be set if use pretrained embeddings.
Configuration: OpenNMT-py version: 3.5.1 Model architecture: Transformer Tokenization: SentencePiece subword tokenization for both source and target languages Pretrained embeddings for the source language
Questions:
How can I resolve the errors encountered when using custom SentencePiece models and pretrained embeddings with OpenNMT-py?
Is there a recommended approach to integrating custom tokenization rules with OpenNMT-py's built-in SentencePiece functionality?
Are there specific configurations or steps required to use pretrained embeddings with SentencePiece tokenization in OpenNMT-py?
Any insights or suggestions from the community would be greatly appreciated.