PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.11k stars 5.55k forks source link

Build n-gram language model for DeepSpeech2, and add inference interfaces insertable to CTC decoder. #2229

Closed xinghai-sun closed 6 years ago

xinghai-sun commented 7 years ago
pkuyym commented 7 years ago

@cxwangyi @kuke @xinghai-sun Hi, as mentioned in the paper, a language model has to be trained to improve the generating results and the LM is a critical component to ensure the performance. The language model is trained on texts crawled from commoncrawl.org using KenLM toolkit. However, we need more details to train such a language model. Any possible to get the trained language model or text dataset trained on?

xinghai-sun commented 7 years ago
  1. 英文的语料应该很多,不一定拘泥于paper提到的语料,我们可以先小语料试,例如PTB。
  2. n-gram LM训练的工具尽可能先用KenLM;如果不用,也尽可能保证 smooth的方法对齐或合理。
  3. 重点关注model loading和inference的接口设计,并且做好和beam search decoder的联调。
  4. 联系NLP或者SVAIL看看有没有现成的powerful LM model,中英文都问下,请@lcy-seso 协助下。
wwfcnu commented 5 months ago
  1. 英文的语料应该很多,不一定拘泥于paper提到的语料,我们可以先小语料试,例如PTB。
  2. n-gram LM训练的工具尽可能先用KenLM;如果不用,也尽可能保证 smooth的方法对齐或合理。
  3. 重点关注model loading和inference的接口设计,并且做好和beam search decoder的联调。
  4. 联系NLP或者SVAIL看看有没有现成的powerful LM model,中英文都问下,请@lcy-seso 协助下。

@lcy-seso 请问有现成可用的LM model可以用吗