Build n-gram language model for DeepSpeech2, and add inference interfaces insertable to CTC decoder.

PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

http://www.paddlepaddle.org/

Apache License 2.0

22.11k stars 5.55k forks source link

Build n-gram language model for DeepSpeech2, and add inference interfaces insertable to CTC decoder. #2229

Closed xinghai-sun closed 6 years ago

xinghai-sun commented 7 years ago

Train an Engish language model (Kneser-Ney smoothed 5-gram, with pruning), with KenLM toolkit, on cleaned text from the Common Crawl Repository. For detailed requirements please refer to DS2 paper.
Add the training script into the DS2 trainer script.
Add inference interfaces for this n-gram language model, insertable to [CTC-LM-beam-search]() for decoding.
Keep in mind that the interfaces should be compatible with both English (word-based LM) and Madarin (character-based LM).
Please work closely with the "Add CTC-LM-beam-search decoder" task.
Refer to the DS2 design doc and update it when necessary.

pkuyym commented 7 years ago

@cxwangyi @kuke @xinghai-sun Hi, as mentioned in the paper, a language model has to be trained to improve the generating results and the LM is a critical component to ensure the performance. The language model is trained on texts crawled from commoncrawl.org using KenLM toolkit. However, we need more details to train such a language model. Any possible to get the trained language model or text dataset trained on?

xinghai-sun commented 7 years ago

英文的语料应该很多，不一定拘泥于paper提到的语料，我们可以先小语料试，例如PTB。
n-gram LM训练的工具尽可能先用KenLM；如果不用，也尽可能保证 smooth的方法对齐或合理。
重点关注model loading和inference的接口设计，并且做好和beam search decoder的联调。
联系NLP或者SVAIL看看有没有现成的powerful LM model，中英文都问下，请@lcy-seso 协助下。

wwfcnu commented 5 months ago

英文的语料应该很多，不一定拘泥于paper提到的语料，我们可以先小语料试，例如PTB。

n-gram LM训练的工具尽可能先用KenLM；如果不用，也尽可能保证 smooth的方法对齐或合理。

重点关注model loading和inference的接口设计，并且做好和beam search decoder的联调。

联系NLP或者SVAIL看看有没有现成的powerful LM model，中英文都问下，请@lcy-seso 协助下。

@lcy-seso 请问有现成可用的LM model可以用吗