Is the tokenizer.model deterministic?

aalto-speech / morfessor

Morfessor is a tool for unsupervised and semi-supervised morphological segmentation

BSD 2-Clause "Simplified" License

180 stars 27 forks source link

This wasn't probably quick enough - sorry - but to add some answer:

The training algorithm goes through the input words in a random order, but by using the --randseed option you should be able to get the same output for the same input.

The intrinsic evaluation for different results is the cost (negative log-probability) that is logged during and after training. The lower the better. The correlation to performance in any actual application may be minimal, but at least sometimes better optimization leads to a higher morphological segmentation accuracy.

aalto-speech / morfessor

Is the tokenizer.model deterministic? #23