关于fastText的ngram - Githubissues

649453932 / Chinese-Text-Classification-Pytorch

中文文本分类，TextCNN，TextRNN，FastText，TextRCNN，BiLSTM_Attention，DPCNN，Transformer，基于pytorch，开箱即用。

MIT License

5.29k stars 1.23k forks source link

逻辑是这样的:

如果用所有的bi-gram和tri-gram先构建dictionary, 则n-gram词表的size会非常大, computational complexity会非常大
因为1. 的原因, 所以作者采用了hash的方式控制了词表大小
如果用2. 中hash的方法或者其他方式控制词表大小, 则绝大多数的n-gram都会变成未出现的n-gram特征(原因是中文文本在做分类时实际上比英文的n-gram特征更加sparse), 在此前提下, 这里的hash方式就是用了Katz backoff, 对bi-gram和tri-gram进行了回退----Katz backoff: 从N-gram回退到(N-1)-gram，例如Count(the,dog)~=Count(dog)

Reference: [Mikolov et al.2011] Tom´aˇs Mikolov, Anoop Deoras,Daniel Povey, Luk´aˇs Burget, and Jan Cernock`y. 2011. ˇStrategies for training large scale neural network language models. In Workshop on Automatic Speech Recognition and Understanding. IEEE. https://en.wikipedia.org/wiki/Katz%27s_back-off_model

649453932 / Chinese-Text-Classification-Pytorch

关于fastText的ngram #94