ngram model - Githubissues

burakisikli commented 5 years ago

How did you create your ngram model(lm.2gram.slm)? In the language modelling section, it's assuming that there is a language model in arpa format to compress using smoothlm. So how can we generate our own uncompressed model using our own data or at least incremental way in addition to your model? Did you use kenLM(https://kheafield.com/code/kenlm/)?

PS: There are some misleading normalization such as "napıyorsun" -> "yapıyorsun", "beklicem" -> "eklicem"

ahmetaa commented 5 years ago

I created a corpora from several text sources. For normalization, rather clean text sources are necessary. However I cannot provide the corpora or the arpa file for now. You can build your own. Some possible sources:

internet news sources.
subtitle data.
wikipedia dumps
TBMM corpora.
TS corpus

then, sentence boundary detection, lower-casing and some other cleaning is applied to raw text. Optionally you can remove duplicated sentences (You can use RemoveDuplicateLines from apps module for it if you want.)

then a limited vocabulary is generated. You can use GenerateVocabulary app from apps module. I selected 300.000 words, but that is arbitrary. Some tools generates vocabulary automatically with min count values.

After that, I created arpa files with Kenlm's estimation tool lmplz https://kheafield.com/code/kenlm/estimation/ You should use pruning either with lmplz during creation, or later with SRILM entropy pruning to reduce size.

then smoothlm conversion is applied with CompressLm application in apps module.

For normalization mistakes, thanks. But many errors are expected in this release.

burakisikli commented 5 years ago

Thanks for your answer. Would you suggest using Zemberek sentence boundary detection, any other tools you recommend, or should I do it by manually? In order to create useful error free model, what is the suggested data size(word, line size)? How will you evaluate the model?

mdakin commented 5 years ago

I think Zemberek does a pretty good job for sentence boundary detection and quite fast, try it yourself to see if it works for you. There is an example in hte examples module. Ahmet can answer the other question probably.

ahmetaa commented 5 years ago

@burakisikli Preparing a corpus for language model generation depends on the task. Usually some form of normalization or cleaning and sentence boundary detection is necessary. As @mdakin said, zembereks TurkishSentenceExtractor is quite good for sentence boundary detection for mostly clean texts. But it assumes sentences are separated with certain punctuation. Read tokenization documentation for more caveats. If that is not the case (lets say very few full stops are used) you may need other kind of tools. There always will be errors but when corpora gets larger effects of mistakes will diminish.

N-gram model evaluation is usually done with perplextiy calculation (you can use SRILM or other tools for that). lower the perplexity, higher the quality. There are many non ngram (neural) models that produces very low perplexity but that is a different subject and I do not have experience with them. I cannot give a definitive answer for vocabulary and sentence size, you need to experiment. Unfortunately Turkish is a particularly sparse language for language models, sometimes word models may not give you the best results. According to my past experiments using a 100.000 word vocabulary covers only %93-95 of the corpus. That is why for some tasks people use sub-word models or very recently neural character models etc (which I do not know much about)

ahmetaa commented 5 years ago

I will note the normalization mistakes but closing this issue.

ahmetaa commented 5 years ago

Also, if you need to improve normalization accuracy, you need more than changing the lm. There are more components involved in that:

lookup tables
split-combine heuristics
ascii and informal tolerant morphological analsis
1 distance spell checker
language model

Later more components may be involved, but not sure

char language models.
word embeddings

ahmetaa / zemberek-nlp

ngram model #191