Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

How does marian use vocab.yaml? #5

Closed sshleifer closed 4 years ago

sshleifer commented 4 years ago

Is it like this? Having trouble understanding the C++ code.

import sentencepiece
import yaml
text = 'What is for dinner ?'
vocab_file = 'en-de/opus.spm32k-spm32k.vocab.yml'

vocab = yaml.load(open(vocab_file), Loader=yaml.BaseLoader)

spm_source = sentencepiece.SentencePieceProcessor()
spm_source.Load('en-de/source.spm')
pieces= spm_source.encode_as_pieces(text) 
ids = [vocab[p] for p in pieces]
sshleifer commented 4 years ago

Resolved, I think it just looks up the piece produced by sentencepiece.