deeppavlov / DeepPavlov

An open source library for deep learning end-to-end dialog systems and chatbots.
https://deeppavlov.ai
Apache License 2.0
6.73k stars 1.15k forks source link

Source corpus for the Russian KenLM language model #995

Closed vadimkantorov closed 4 years ago

vadimkantorov commented 5 years ago

What was the source data for the http://files.deeppavlov.ai/lang_models/ru_wiyalen_no_punkt.arpa.binary.gz language model? What KenLm arguments were used for the estimation?

Thank you!

yoptar commented 5 years ago

Hi @vadimkantorov, The 3-gram model was trained on russian wikipedia and a closed news dataset. All punctuation characters were removed beforehand.