Closed somniumism closed 2 years ago
This wasn't probably quick enough - sorry - but to add some answer:
The training algorithm goes through the input words in a random order, but by using the --randseed
option you should be able to get the same output for the same input.
The intrinsic evaluation for different results is the cost (negative log-probability) that is logged during and after training. The lower the better. The correlation to performance in any actual application may be minimal, but at least sometimes better optimization leads to a higher morphological segmentation accuracy.
Hi, I'm developing a tokenizer based on Korean. Since my project is to develop a language model using SRILM's
ngram
, the role of tokenizer is very important. I couldn't experiment because of the large capacity of the corpus, but I want to hear your answer quickly, so I'm leaving an issue.Is the result of morfessor deterministic? In other words, will the same model be created after repeated learning dozens of times? If it is non-deterministic, are there any index or methods to measure how different the performance of results(tokenizers) varies?