Vocabulary size for predictor training

Unbabel / OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch

https://unbabel.github.io/OpenKiwi/

GNU Affero General Public License v3.0

229 stars 48 forks source link

Vocabulary size for predictor training #64

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hi,

First of all, thank you for OpenKiwi.

My question: when training the Predictor, what exactly are source-vocab-size and target-vocab-size supposed to mean: the number of (all) tokens or the number of unique words in the training corpora?

Perhaps adding this to the relevant docs page would be helpful to less experienced users of frameworks of this kind.

Best, Andras (TAUS)