aalto-speech / morfessor

Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
http://morpho.aalto.fi
BSD 2-Clause "Simplified" License
180 stars 27 forks source link

Is the tokenizer.model deterministic? #23

Closed somniumism closed 2 years ago

somniumism commented 3 years ago

Hi, I'm developing a tokenizer based on Korean. Since my project is to develop a language model using SRILM's ngram, the role of tokenizer is very important. I couldn't experiment because of the large capacity of the corpus, but I want to hear your answer quickly, so I'm leaving an issue.

Is the result of morfessor deterministic? In other words, will the same model be created after repeated learning dozens of times? If it is non-deterministic, are there any index or methods to measure how different the performance of results(tokenizers) varies?

svirpioj commented 2 years ago

This wasn't probably quick enough - sorry - but to add some answer:

The training algorithm goes through the input words in a random order, but by using the --randseed option you should be able to get the same output for the same input.

The intrinsic evaluation for different results is the cost (negative log-probability) that is logged during and after training. The lower the better. The correlation to performance in any actual application may be minimal, but at least sometimes better optimization leads to a higher morphological segmentation accuracy.