wanghaisheng commented 8 years ago

斯坦福大学CS224d课程 http://blog.csdn.net/han_xiaoyang/article/details/51567822

wanghaisheng commented 8 years ago

http://torch.ch/blog/2016/07/25/nce.html 语言模型

wanghaisheng commented 8 years ago

https://mp.weixin.qq.com/s?__biz=MzA4OTk5OTQzMg==&mid=2449231335&idx=1&sn=d3ba98841e85b7cea0049cc43b3c16ca 基于Deep Learning的中文分词尝试

wanghaisheng commented 8 years ago

数据集 HFL-RC: A Chinese Reading Comprehension Dataset http://hfl.iflytek.com/chinese-rc/ http://arxiv.org/abs/1607.02250 【哈工大讯飞联合实验室发布中文阅读理解数据集】

2016年7月18日，哈工大讯飞联合实验室（HFL）发布填空型中文阅读理解数据集，其中包括《人民日报》新闻数据集和“儿童读物”数据集（HFL-RC: People Daily and CFT dataset）。

在英文阅读理解方面，已有Google DeepMind CNN/Daily Mail数据集，Facebook CBTest数据集，但一直缺少中文阅读理解数据集。此次，HFL发布的数据集不但填补了中文阅读理解的空白，而且与上述两个英文数据集不同，在HFL发布的“儿童读物”数据集中还包含了人工问题，人工问题比自动构造的问题更难回答，这为阅读理解的研究提出了新的挑战。

此外，我们还设计了一种简单有效的神经网络来解决填空型阅读理解问题，并获得了良好的效果。哈工大讯飞联合实验室（HFL）发力研究“阅读理解”这个备受业界关注的人工智能难题，相关成果将陆续发布。

wanghaisheng commented 8 years ago

topwords 实验登录用户为wanghs 当前目录为 /home/wanghs/projects/ 下载源码 git clone https://github.com/qf6101/topwords

上传测试数据到hdfs 创建上传的文件夹

[wanghs@node3 test_data]$ sudo -u hdfs hadoop dfs -mkdir -p /data/topwords/test_data 如果不修改权限则会出现 [wanghs@node3 test_data]$ sudo -u hdfs hadoop fs -copyFromLocal /home/wanghs/projects/topwords/test_data/story_of_stone.txt /data/topwords/test_data copyFromLocal: `/home/wanghs/projects/topwords/test_data/story_of_stone.txt': No such file or directory 按照如下修改文件夹权限 [wanghs@node3 test_data]$ sudo -u hdfs hadoop fs -chown -R wanghs:wanghs /data 复制本地文件到HDFS [wanghs@node3 test_data]$ hadoop fs -copyFromLocal story_of_stone.txt /data/topwords/test_data

[wanghs@node3 test_data]$ hadoop fs -ls /data/topwords/test_data Found 1 items -rw-r--r-- 3 wanghs wanghs 2633092 2016-08-16 18:56 /data/topwords/test_data/story_of_stone.txt

同样的创建结果保存目录

[wanghs@node3 test_data]$ sudo -u hdfs hadoop fs -mkdir -p /data/topwords/output

修改配置文件

#!/usr/bin/env bash

# get into the current directory
cd "$( cd "$( dirname "$0"  )" && pwd  )"

##### The Parameters You Need to Predefine Start #####

# set the environment variables
HADOOP_HOME="/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop"
SPARK_HOME="/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark"
topwords_jar="../release/topwords-1.0.jar"  #topwords jar file

# set the arguments
inputLoc=" /data/topwords/test_data/*"  #location of input corpus in HDFS
outputLoc="/data/topwords/output" #location of output dictionary and segmented corpus in HDFS
tauL="6"  #threshold of word length
tauF="100"  #threshold of word frequency
numIterations="5"  #number of iterations
convergeTol="1E-3"  #convergence tolerance
textLenThld="2000"  #preprocessing threshold of text length
useProbThld="1E-8"  #prune threshold of word use probability
wordBoundaryThld="0.0"  #segment threshold of word boundary score (use segment tree if set to <= 0)
numPartitions="5000"  #number of partitions
executor_memory="5G"  #memory allocation for each executor
num_executors="40"  #number of executors allocated
executor_cores="1"  #number of cores allocated for each executor
queue="queue_name"  #yarn queue

##### The Parameters You Need to Predefine End #####

# execute the TopWORDS algorithm

function_exec(){
${SPARK_HOME}/bin/spark-submit \
--class io.github.qf6101.topwords.TopWORDSApp \
--master yarn \
--deploy-mode cluster \
--name topwords \
--executor-memory $executor_memory \
--num-executors $num_executors \
--executor-cores $executor_cores \
--queue $queue \
${topwords_jar} \
--inputLoc $inputLoc \
--outputLoc $outputLoc \
--tauL $tauL \
--tauF $tauF \
--numIterations $numIterations \
--convergeTol $convergeTol \
--textLenThld $textLenThld \
--useProbThld $useProbThld \
--wordBoundaryThld $wordBoundaryThld \
--numPartitions $numPartitions
}

output="topwords.running"
function_exec > ${output} 2>&1

sleep 60s
app_id=`grep -Eo "application_[0-9]+_[0-9]+" ${output} | head -n1`
logfile=${output}.log
${HADOOP_HOME}/bin/yarn logs -applicationId ${app_id} > ${logfile}
#check execution result
if grep -i ERROR ${logfile}
then
        rm $output
        echo "Exception occurred in $0. See `readlink -f ${logfile}`"
        exit 1
else
        rm $output
        echo "Finish $0."
        exit 0
fi

修改/etc/profile


export HADOOP_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop
export HIVE_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hive
export HBASE_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hbase
export HADOOP_HDFS_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-hdfs
export HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-mapreduce
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-hdfs
export HADOOP_LIBEXEC_DIR=${HADOOP_HOME}/libexec
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HDFS_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_YARN_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-yarn
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop

报错如下无法解决搁置


[wanghs@node3 topwords]$ bash deploy/sbin/topwords_local.sh
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
        at io.github.qf6101.topwords.TopWORDSApp$.main(TopWORDSApp.scala:15)
        at io.github.qf6101.topwords.TopWORDSApp.main(TopWORDSApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 11 more
Running TopWORDS fail at Wed Aug 17 13:32:53 CST 2016

使用国内镜像再次进行尝试 https://hub.tenxcloud.com/repos/sdvdxl/spark-native-yarn

wanghaisheng commented 8 years ago

http://nbviewer.jupyter.org/github/jayantj/gensim/blob/683720515165a332baed8a2a46b6711cefd2d739/docs/notebooks/Word2Vec_FastText_Comparison.ipynb
使用fasttext将解析后的xps报告进行一次试验

wanghaisheng commented 8 years ago

What industries are next to be disrupted by NLP and Text Analysis? http://blog.aylien.com/nlp-text-analysis-insurance-legal-customer-service/

wanghaisheng commented 8 years ago

词性标注 https://arxiv.org/pdf/1609.07053v1.pdf To address the first question, we will look at convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which are both highly prominent approaches in recent natural language pro- cessing (NLP) literature. A recent development is the emergence of deep residual networks (ResNets), a building block for CNNs. ResNets consist of several stacked residual units, which can be thought of as a collection of convolutional layers coupled with a ‘shortcut’ which aids the propagation of the signal in a neural network. This allows for the construction of much deeper networks, since keeping a ‘clean’ information path in the network facilitates optimisation (He et al., 2016). ResNets have recently shown state-of-the-art performance for image classification tasks (He et al., 2015; He et al., 2016), and have

wanghaisheng commented 8 years ago

Welcome to the awesome-nlp wiki!

中文语言模型的训练
- n-gram
- RNN https://www.tensorflow.org/versions/master/tutorials/recurrent/index.html#recurrent-neural-networks https://github.com/dmlc/mxnet/blob/f786095665790ba3b877e8e3f7916361e6e2e3b1/docs/system/rnn_interface.md
- 语料来源中文维基百科+北大人民日报+豆瓣影评，总共6亿多字吧
  http://www.sogou.com/labs/resource/list_yuliao.php
  http://sighan.cs.uchicago.edu/bakeoff2005/ https://github.com/IntelLabs/rnnlm/tree/master/data https://github.com/yandex/faster-rnnlm (1).中科院自动化所的中英文新闻语料库 http://www.datatang.com/data/13484 中文新闻分类语料库从凤凰、新浪、网易、腾讯等版面搜集。英语新闻分类语料库为Reuters-21578的ModApte版本。 (2).搜狗的中文新闻语料库 http://www.sogou.com/labs/dl/c.html 包括搜狐的大量新闻语料与对应的分类信息。有不同大小的版本可以下载。 (3).李荣陆老师的中文语料库 http://www.datatang.com/data/11968 压缩后有240M大小 (4).谭松波老师的中文文本分类语料 http://www.datatang.com/data/11970 不仅包含大的分类，例如经济、运动等等，每个大类下面还包含具体的小类，例如运动包含篮球、足球等等。能够作为层次分类的语料库，非常实用。这个网址免积分(谭松波老师的主页)：http://www.searchforum.org.cn/tansongbo/corpus1.php (5).网易分类文本数据 http://www.datatang.com/data/11965 包含运动、汽车等六大类的4000条文本数据。 (6).中文文本分类语料 http://www.datatang.com/data/11963 包含Arts、Literature等类别的语料文本。 (7).更全的搜狗文本分类语料 http://www.sogou.com/labs/dl/c.html 搜狗实验室发布的文本分类语料，有不同大小的数据版本供免费下载 (8).2002年中文网页分类训练集 http://www.datatang.com/data/15021 2002年秋天北京大学网络与分布式实验室天网小组通过动员不同专业的几十个学生，人工选取形成了一个全新的基于层次模型的大规模中文网页样本集。它包括11678个训练网页实例和3630个测试网页实例，分布在11个大类别中。 http://www.blogjava.net/wangxinsh55/archive/2016/01/13/429028.aspx

微博词 https://github.com/StevenLOL/chinese-weibo-noun/tree/8c802b4f6139cecad96e907a0632244cbc17dab8

中英文维基百科语料上的Word2Vec实验 http://www.52nlp.cn/%e4%b8%ad%e8%8b%b1%e6%96%87%e7%bb%b4%e5%9f%ba%e7%99%be%e7%a7%91%e8%af%ad%e6%96%99%e4%b8%8a%e7%9a%84word2vec%e5%ae%9e%e9%aa%8c

http://licstar.net/archives/262 维基百科简体中文语料的获取 https://github.com/dsindex/syntaxnet
https://github.com/tensorflow/models/tree/master/syntaxnet

Comparison of FastText and Word2Vec

TopWORDS1是近期在PNAS发表的一种方法，它在没有任何先验知识的条件下，快速地从大规模中文语料里学习出一个排序的词典以及语料文本的分词结构。 http://qf6101.github.io/machine%20learning/2016/07/01/TopWORDS

https://mp.weixin.qq.com/s?__biz=MzAxMzA2MDYxMw==&mid=2651555619&idx=1&sn=4cdc0e19cf259845825f6a95707e1105 【干货】邓柯：基于统计学模型的无指导中文文本分析

在谷歌最新的深度学习论文中，Oriol Vinyals与Geoff Hinton等人把LSTM用到了NLP的Parsing问题上，并且得到了不错的结果。O http://arxiv.org/pdf/1412.7449v1.pdf

我们在NAACL-16 上的Tutorial: Recent Progress on DL 4 NLP 的slides O网页链接 , 以及我在QA workshop上的talk: Towards Neural-Net-based QA 的slides O网页链接 "Deep Learning For Natural Language Processing"上周日在香港城市大学 Seminar on DL4NLP ( O http://lt.cityu.edu.hk/Research/cel/ http://nlp.fudan.edu.cn/xpqiu/slides/20160618_DL4NLP@CityU.pdf 自然语言处理前沿技术研讨会暨清华大学“计算未来”硕博论坛顺利召开

http://www.cipsc.org.cn/qngw/?p=800

Language Understanding for Text-based Games using Deep Reinforcement Learning #PaperWeekly# http://rsarxiv.github.io/2016/06/27/Language-Understanding-for-Text-based-Games-using-Deep-Reinforcement-Learning-PaperWeekly/

【哈工大讯飞联合实验室在零指代消解问题上取得进展】

在自然语言理解的零代词消解问题上，人工标注的训练数据非常有限，深度学习的能力无法发挥出来。哈工大讯飞联合实验室的研究人员们提出了一种自动构造大规模“伪训练数据”的方法：在一篇文本中，如果一个名词出现了两次，则将后面出现的一次变成空槽（零代词），从而构成一个“零代词消解”实例，而这个零代词的先行词（即应该填入空槽的答案）就是该名词本身。如此，可以构造无限量的“伪训练数据”，这些数据与真实数据的特征不完全一致，但数量巨大，可以用来做预训练，然后再结合数量很有限的“真实训练数据”，在统一的深度学习模型框架下，很快取得了超过现有最好的中文零指代消解指标5个百分点的进步。此方法简洁纯净，易于领域移植，且未来还有很大的提升空间。

论文已经放在了arXiv上：O网页链接，作者：刘挺、崔一鸣、尹庆宇、王士进、张伟男、胡国平，欢迎同行们批评指正。

零指代问题举例：“小明去找他妈妈了，【】一直没回来”，到底是“谁没回来”，应该是“小明”，而不是“妈妈”

当前国内外在自然语言处理领域的研究热点和难点？ https://www.zhihu.com/question/30305058

facebook fasttext https://github.com/facebookresearch/fastText https://github.com/kemaswill/fasttext_torch

我觉得这个工作的最有意思的地方是，能够为实体找到最有信息量的句子，这些句子往往是该实体的定义或描述。这样，在构建知识图谱时，我们就可以自动为新增的实体构建对应的文本描述信息了。[微笑]

cs.CL daily# 当前知识表示存在两个挑战：1、如何更好地利用entity的context；2、如何发现与entity相关的句子；针对这两个问题，本文提出了一种从多个句子中学习表示的模型。 Knowledge Representation via Joint Learning of Sequential Text and Knowledge Graphs O

http://arxiv.org/abs/1609.07075

wanghaisheng commented 8 years ago

blogs http://colah.github.io/
http://t.cn/Rq8UhU6

[干货]深度学习即将攻陷的下一个领域：NLP——ACL2016优秀论文解读（上篇） http://mp.weixin.qq.com/s?__biz=MzIzOTU0NTQ0MA==&mid=2247483864&idx=1&sn=75136bfb9afc4e4f3ed1d3697151aef3 [Deep Learning in NLP词向量和语言模型](http://licstar.net/archives/328）

《How to Generate a Good Word Embedding?》导读
 http://www.hankcs.com/nlp/

【立委科普：自然语言系统架构简说】 http://blog.sciencenet.cn/home.php?mod=space&uid=362400&do=blog&quickforward=1&id=981742

《深度学习与自然语言处理(斯坦福cs224d)》by 寒小阳, 龙心尘 Lecture1:O http://blog.csdn.net/longxinchen_ml/article/details/51567960 深度学习与自然语言处理(7)_斯坦福cs224d 语言模型，RNN，LSTM与GRU http://blog.csdn.net/longxinchen_ml/article/details/51940065

(Slides)《Using Text Embeddings for Information Retrieval》by Bhaskar Mitra O

深度学习浪潮中的自然语言处理技术

http://t.cn/R5GX0Jc http://pan.baidu.com/s/1kVlN3YB

wanghaisheng commented 8 years ago

phd paper 童鞋3年前推荐过一个博士论文列表
http://www.weibo.com/1657470871/zqgjp9sMK?type=comment#_rnd1462289209093

其实就斯坦福NLP，伯克利NLP, CMU LTI, JHU CSLP四个学校的近期博士论文看一看，领域概况就能了解一大半。

Berkeley

Liang,P 2011 Learning Dependency-Based Compositional Semantics

liang是斯坦福计算机系的助理教授 Standford NLP组的第三个教授，硕士毕业于MIT的Mike Collins门下。本文是其在ACL2011上讲座关于computational semantics工作的扩展，记得当年他在ACL报告演示系统时全场座无虚席，很多人站着，掌声雷动。论文很短，不难。但很有意思

Cornell

Danescu-NiculescuMizil，D 2012 A computational Approach to Linguistic Coordination

这是一个非常另类的NLP博士论文来自康奈尔Lillian Lee老师的笛子，论文介绍了语言学在的entrainment现象，非常段且好动，可以简单了解一下。

Cambridge

Wallach H. M. 2008 structured topic models for language

Wallac是UMASS计算机系助理教授，剑桥大学信息论大神MacKay教授的女笛子，论文对topic model有较深的分析和理解，尤其是非参的topic model，同时还介绍了非参topic model在句法分析上的应用。另外Wallach还写过一篇著名的CRF的tutorial

MIT

Eisenstein ，J 2008 Gesture in Automatic Discourse processing

Eisenstein现任Gatech计算机学院助理教授，博士毕业于MIT的Barzilay门下，论文写的很有特色，是研究手势的，与普通的NLP论文截然不同

Edinburgh

Callison Burch ,C 2007 Paraphrasing and Translation

CCB是MT领域比较知名的学者，2013年会去做UPenn助理教授。本论文主要从paraphrasing角度触发，研究了paraphrasing在MT上的应用

Brown

Goldwater ,s 2007 Nonparametric Bayesian Model of Lexical Acquisition

Goldwater 是近年来研究非参贝叶斯模型以及其在NLP上应用非常出色的学者之一。最近不仅在研究lexical acquisition，还用很有意思的非参贝叶斯模型来研究语音学上的现象。
Elsner,M 2011 Generalizing Local COherence Modeling

Elsner 现在是OSU语言系的助理教授，博士论文主要研究的是篇章级别上的对话建模

U.Penn

Talukdar,P.P 2010 Graph Based weakly supervised methods for information extraction and integration

Talukdar也是GOOGLE研究总监Fernando Pereira在宾大任教期间最后收的几个弟子之一，现在CMU机器学习系和Mitchell做博士后。本文主要介绍了图论方法在信息抽取上的应用。对信息抽取有兴趣的同学可以了解下。
Blitzer,J 2008 domain adaptation of natual language processing systems

BlitzerGOOGLE研究总监Fernando Pereira在宾大任教期间最后收的几个弟子之一,本文主要介绍了一种叫做structural correspondence learning的domain adaptation方法不难理解也不难实现

U.Maryland
Lopez ，A 2008 Machine translation by pattern matching

机器翻译博士论文写的还是笔记完整的，介绍了pattern matching在Tera级别MT上的一些实验

[博士论文《基于神经网络的词和文档语义向量表示方法研究》](http://licstar.net/archives/687） http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

wanghaisheng commented 8 years ago

Library 分词
评估 http://www.scholat.com/vpost.html?pid=4477
常用分词组件基准测试 https://github.com/ysc/cws_evaluation
浅谈中文分词
 https://github.com/fxsjy/jieba/
https://github.com/NLPchina/ansj_seg
https://github.com/huaban/jieba-analysis
https://github.com/hankcs/HanLP

https://github.com/FudanNLP/fnlp
http://ictclas.nlpir.org/nlpir/

https://github.com/nltk/nltk NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing.

https://github.com/thunlp
https://github.com/NLPchina

https://github.com/WILAB-HIT/News/tree/master/2015/10/30
http://wi.hit.edu.cn/cemr/
https://github.com/woshialex/diagnose-heart
https://github.com/ysc/cws_evaluation
'RNNLM Toolkit - Recurrent Neural Network Language Modeling (RNNLM) Toolkit' by Intel Labs GitHub: O

https://github.com/NLPchina/ansj_seg 词性标注
https://www.zhihu.com/question/19929473

https://github.com/thunlp

目前包括：知识表示学习工具包KG2E，关键词抽取和标签推荐工具包THUTag，中文词法分析工具包THULAC，中文文本分类工具包THUCTC等。欢迎关注试用，并提出宝贵意见。[微笑]

https://github.com/lionsoul2014/jcseg Jcseg是基于mmseg算法的一个轻量级开源中文分词器，同时集成了关键字提取，关键短语提取，关键句子提取和文章自动摘要等功能，并且提供了最新版本的lucene, solr, elasticsearch的分词接口。 http://git.oschina.net/lionsoul/jcseg

目前常用的自然语言处理开源项目/开发包有哪些？

请从文本/语音两方面分别介绍一些。

[举报](#) [添加评论](#) [分享](#) • [邀请回答](#)

按投票排序 [按时间排序](/question/19929473?sort=created)

### 19 个回答

[![](https://pic4.zhimg.com/f844abbf7_s.jpg) ](/people/liuliudong) [刘知远](/people/liuliudong)，NLPer

[ff4415](https://www.zhihu.com/people/ff4415 "ff4415")、[格调七弦](https://www.zhihu.com/people/ge-diao-qi-xian "格调七弦")、[李木子](https://www.zhihu.com/people/li-mu-zi-88-71 "李木子") [等人赞同](javascript:;)

最近我们实验室整理发布了一批开源NLP工具包，这里列一下，欢迎大家使用。未来不定期更新。 2016年3月31日更新，在THULAC新增Python版本分词器，欢迎使用。 **中文词法分析** [THULAC：一个高效的中文词法分析工具包](//link.zhihu.com/?target=http%3A//thulac.thunlp.org/) 包括中文分词、词性标注功能。已经提供C++、Java、Python版本。 **中文文本分类** [THUCTC: 一个高效的中文文本分类工具](//link.zhihu.com/?target=http%3A//thuctc.thunlp.org/) 提供高效的中文文本特征提取、分类训练和测试功能。 **THUTag: 关键词抽取与社会标签推荐工具包** [GitHub - YeDeming/THUTag: A Package of Keyphrase Extraction and Social Tag Suggestion](//link.zhihu.com/?target=https%3A//github.com/YeDeming/THUTag/) 提供关键词抽取、社会标签推荐功能，包括TextRank、ExpandRank、Topical PageRank（TPR）、Tag-LDA、Word Trigger Model、Word Alignment Model等算法。 **PLDA / PLDA+: 一个高效的LDA分布式学习工具包** [https://code.google.com/archive/p/plda/](//link.zhihu.com/?target=https%3A//code.google.com/archive/p/plda/) **知识表示学习** 知识表示学习工具包 [GitHub - Mrlyk423/Relation_Extraction: Knowledge Base Embedding](//link.zhihu.com/?target=https%3A//github.com/mrlyk423/relation_extraction) 包括TransE、TransH、TransR、PTransE等算法。考虑实体描述的知识表示学习算法 [GitHub - xrb92/DKRL: Representation Learning of Knowledge Graphs with Entity Descriptions](//link.zhihu.com/?target=https%3A//github.com/xrb92/DKRL) **词表示学习** 跨语言词表示学习算法 [Learning Cross-lingual Word Embeddings via Matrix Co-factorization](//link.zhihu.com/?target=http%3A//nlp.csai.tsinghua.edu.cn/%7Elzy/src/acl2015_bilingual.html) 主题增强的词表示学习算法 [GitHub - largelymfs/topical_word_embeddings: A demo code for topical word embedding](//link.zhihu.com/?target=https%3A//github.com/largelymfs/topical_word_embeddings) 可解释的词表示学习算法 [GitHub - SkTim/OIWE: Online Interpretable Word Embeddings](//link.zhihu.com/?target=https%3A//github.com/SkTim/OIWE) 考虑字的词表示学习算法 [GitHub - Leonard-Xu/CWE](//link.zhihu.com/?target=https%3A//github.com/Leonard-Xu/CWE) **网络表示学习** 文本增强的网络表示学习算法 [GitHub - albertyang33/TADW: code for IJCAI2015 paper "Network Representation Learning with Rich Text Information"](//link.zhihu.com/?target=https%3A//github.com/albertyang33/TADW)

[编辑于 2016-03-31](/question/19929473/answer/90201148) [28 条评论](#) [感谢](#) [分享](#) [收藏](#) • [没有帮助](#) • [举报](#) • [作者保留权利](/terms#sec-licence-1)

[![](https://pic4.zhimg.com/b493fcf67_s.jpg) ](/people/kong-mu) [孔牧](/people/kong-mu)，软件工程师

[王二毛](https://www.zhihu.com/people/wang-han-46-82 "王二毛")、[潇涧](https://www.zhihu.com/people/hujiaweibujidao "潇涧")、[Alex Green](https://www.zhihu.com/people/alex-green-40 "Alex Green") [等人赞同](javascript:;)

我只清楚文本方面的开源项目，希望能帮到你：一整套文本挖掘流水线GATE：[http://gate.ac.uk/](//link.zhihu.com/?target=http%3A//gate.ac.uk/) 你可以按照它的要求向其中添加组件，完成自己的nlp任务我在的项目组曾经尝试过使用，虽然它指出组件开发，但是灵活性还是不高，所以我们自己又开发了一套流水线。国内一个NLP工具：哈工大LTP：[http://ir.hit.edu.cn/](//link.zhihu.com/?target=http%3A//ir.hit.edu.cn/) 这个是一个较完善的流水线了，不说质量怎么样，它提供分词、语义标注、句法依赖、实体识别。虽然会出现错误的结果，但是，找不到更好的了。中科院分词ICTCLAS 一个比较权威的分词器，相信你最后会选择它作为项目的分词工具，虽然本身存在很多问题，但是我找不到更好的开源项目了。微软分词MOSS 当然这个是不开源的，但是分词非常准，但是悲剧的是它将分词和实体识别同时完成了，而且分词（在它提供的工具中）不提供词性标注。句法分析 Stanford Parser 这个据说非常不能用，在中文方面，试试吧。以上都是成品，下面是一些算法开发包：比较新的序列标注算法CRF的开源项目： CRF++ 经典模型SVM： svm-light 和 lib svm

[发布于 2011-11-24](/question/19929473/answer/13381815) [3 条评论](#) [感谢](#) [分享](#) [收藏](#) • [没有帮助](#) • [举报](#) • [作者保留权利](/terms#sec-licence-1)

[![](https://pic4.zhimg.com/9a43bcb7f_s.jpg) ](/people/xiaozhibo) [肖智博](/people/xiaozhibo)，兴趣所在

[王昊](https://www.zhihu.com/people/wang-hao-58 "王昊")、[solisinvicti](https://www.zhihu.com/people/yytbob "solisinvicti")、[没有人](https://www.zhihu.com/people/mei-you-ren-87 "没有人") [等人赞同](javascript:;)

竟然没有人提 [Natural Language Toolkit](//link.zhihu.com/?target=http%3A//nltk.org/)

[发布于 2013-07-17](/question/19929473/answer/17952336) [6 条评论](#) [感谢](#) [分享](#) [收藏](#) • [没有帮助](#) • [举报](#) • [作者保留权利](/terms#sec-licence-1)

[![](https://pic1.zhimg.com/95cf898f3171bc58373237c0bdd8ab9c_s.jpg) ](/people/dai-lei) [戴磊](/people/dai-lei)，自然语言处理，传统意义上的"好人"和”单…

[瞋德](https://www.zhihu.com/people/chen-yun-48-22 "瞋德")、[river](https://www.zhihu.com/people/riverphoenix1111111 "river")、[陈村](https://www.zhihu.com/people/xjiangxjxjxjx "陈村") 赞同

**_NiuTrans_** 由[东北大学自然语言处理实验室](//link.zhihu.com/?target=http%3A//www.nlplab.com/)研制开发，它支持多个统计机器翻译模型（基于短语，基于层次短语，基于句法），内嵌小巧、高效的_N_-元语言模型，无需其它软件（如SRILM）的外部支持。下载地址：[NiuTrans下载](//link.zhihu.com/?target=http%3A//www.nlplab.com/NiuPlan/NiuTrans.ch.html)

[发布于 2016-03-11](/question/19929473/answer/90188769) [1 条评论](#) [感谢](#) [分享](#) [收藏](#) • [没有帮助](#) • [举报](#) • [作者保留权利](/terms#sec-licence-1)

[![](https://pic1.zhimg.com/da8e974dc_s.jpg) ](/people/yifan) [贺一帆](/people/yifan)

[lankaka](https://www.zhihu.com/people/lankaka "lankaka")、[Spirit_Dongdong](https://www.zhihu.com/people/spirit-dongdong "Spirit_Dongdong")、[刘毅](https://www.zhihu.com/people/gavin1332 "刘毅") [等人赞同](javascript:;)

语音方面有CMU的Sphinx：[http://cmusphinx.sourceforge.net/](//link.zhihu.com/?target=http%3A//cmusphinx.sourceforge.net/) 还有剑桥大学的HTK：[HTK Speech Recognition Toolkit](//link.zhihu.com/?target=http%3A//htk.eng.cam.ac.uk/) 另外，中文分词现在有源自CSDN的ansj：[ansjsun/ansj_seg · GitHub](//link.zhihu.com/?target=https%3A//github.com/ansjsun/ansj_seg) 基于ICTCLAS，效果不错，接口简明。

[发布于 2013-06-05](/question/19929473/answer/17372131) [添加评论](#) [感谢](#) [分享](#) [收藏](#) • [没有帮助](#) • [举报](#) • [作者保留权利](/terms#sec-licence-1)

[![](https://pic3.zhimg.com/1590fdda3cae8d6c7a01ef595d0258e6_s.png) ](/people/xpqiu) [邱锡鹏](/people/xpqiu)，自然语言处理

[碧彤怡](https://www.zhihu.com/people/bi-tong-yi "碧彤怡")、[xuanjing huang](https://www.zhihu.com/people/xuanjing-huang "xuanjing huang")、[李奕](https://www.zhihu.com/people/li-yi-83-91 "李奕") [等人赞同](javascript:;)

如果除了分词，还想做些更深入的分析，推荐开源的FNLP [GitHub](//link.zhihu.com/?target=https%3A//github.com/xpqiu/fnlp/) 利益相关：FNLP项目负责人

[编辑于 2014-09-05](/question/19929473/answer/17854072) [11 条评论](#) [感谢](#) [分享](#) [收藏](#) • [没有帮助](#) • [举报](#) • [作者保留权利](/terms#sec-licence-1)

[![](https://pic2.zhimg.com/48a091259_s.jpg) ](/people/pijili) [李丕绩](/people/pijili)，Text Mining, Computer Vision, Machine …

[梦大人](https://www.zhihu.com/people/meng-da-ren "梦大人")、[simon](https://www.zhihu.com/people/zhaohuaipeng "simon")、[lxgone](https://www.zhihu.com/people/lxgone "lxgone") [等人赞同](javascript:;)

哈工大LTP，有全面又好用，分词、词性标注、NER、句法分析等。

[发布于 2016-03-25](/question/19929473/answer/92228199) [添加评论](#) [感谢](#) [分享](#) [收藏](#) • [没有帮助](#) • [举报](#) • [作者保留权利](/terms#sec-licence-1)

[![](https://pic3.zhimg.com/4cc285e12_s.jpg) ](/people/hobermallow) [mallow](/people/hobermallow)

[碧彤怡](https://www.zhihu.com/people/bi-tong-yi "碧彤怡")、[Yan Zhang](https://www.zhihu.com/people/yan-zhang-45-43 "Yan Zhang") 赞同

分词推荐ansj. [https://github.com/NLPchina/ansj_seg](//link.zhihu.com/?target=https%3A//github.com/NLPchina/ansj_seg) 比他老师张华平博士的东西用着顺手些。现在也有关键词等功能，很强大。 word2vec.这是个好东西，把词变成向量以后，很多事都变得容易了

[发布于 2014-11-03](/question/19929473/answer/32831706) [添加评论](#) [感谢](#) [分享](#) [收藏](#) • [没有帮助](#) • [举报](#) • [作者保留权利](/terms#sec-licence-1)

[![](https://pic1.zhimg.com/da8e974dc_s.jpg) ](/people/chen-yang-yang-3) [沈阳阳](/people/chen-yang-yang-3)，为机器翻译技术应用的美好明天而奋斗

[勾陈一](https://www.zhihu.com/people/gou-chen-___ "勾陈一") 赞同

东北大学自然语言处理实验室今天上半年推出了一套中文自动分析工具NiuParser，你可以了解一下

[发布于 2014-11-05](/question/19929473/answer/32940932) [添加评论](#) [感谢](#) [分享](#) [收藏](#) • [没有帮助](#) • [举报](#) • [作者保留权利](/terms#sec-licence-1)

[![](https://pic4.zhimg.com/b8ba74aef_s.jpg) ](/people/wu-bo-wen-99) [武博文](/people/wu-bo-wen-99)，NLP/Resys with Java/Python

[郭同jetNLP](https://www.zhihu.com/people/guotong198801 "郭同jetNLP") 赞同

列几个强大的Python套件们吧。前面提到了出现顺序实属意外的NLTK，应该是非常强大的工具包了，3.x封了stanford nlp很多接口。另外语料，newspaper、TextBlob等都是质量蛮高的开源项目。中文的话，结巴分词、snownlp都挺不错的。

wanghaisheng commented 8 years ago

MOOC

斯坦福大学深度学习与自然语言处理第一讲：引言斯坦福大学深度学习与自然语言处理第二讲：词向量
 斯坦福大学深度学习与自然语言处理第三讲：高级的词向量表示斯坦福大学深度学习与自然语言处理第四讲：词窗口分类和神经网络

wanghaisheng commented 8 years ago

nlp与医学

http://wi.hit.edu.cn/cemr/ 基于中文电子病历文本的在线演示
这里的命名实体识别再辅以机构名称、人名、等识别结合pullword 可以辅助数据提取

https://github.com/wanghaisheng/models/tree/master/syntaxnet 打算用google的这个tensorflow的模型来训练我们的分词和词性标注

http://www.xunsearch.com/scws/

http://xueshu.baidu.com/s?wd=paperuri%3A%28680725de0f671ad53355c02c8c236392%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fcdmd.cnki.com.cn%2FArticle%2FCDMD-10004-1015611084.htm&ie=utf-8&sc_us=12482090631496785537

面向命名实体抽取的大规模中医临床病历语料库构建方法研究

http://blog.sciencenet.cn/home.php?mod=space&uid=362400&do=blog&id=1001697 关注一下这个人的博客了解整体情况

http://blog.sciencenet.cn/home.php?mod=space&uid=362400&do=blog&id=932462 信息抽取系统架构

Zachary Lipton# RNN/LSTM序列建模及其在医学和NLP中的应用。1综述RNN for Sequence Learning：论文(38页) +幻灯片(80页) 2从儿童重症室感知时序数据中多标签分类疾病Learning to Diagnose with LSTM Recurrent Neural Networks [ICLR16] 3博客(KDnuggets,IEEE Spectrum) http://t.cn/RqKBanr

http://zacklipton.com/

nlp与结构化提取
nlp与关键词提取

http://t.cn/zQivPUK deep learning in nlp →词向量与语言模型，，，通俗的不能再通俗了。推荐。

nlp 关键技术 nlp常见处理流程以bot为例
http://licstar.net/archives/328

第44期：深入NLP——看中文分词如何影响你的生活点滴 | 硬创公开课 http://mp.weixin.qq.com/s?__biz=MzIzMjIwNzM4OA==&mid=2650042180&idx=1&sn=0a78ab62e41a6ddeabb00729335911c7

哈工大LTP和中科院NLPIR中文分词比较

http://blog.csdn.net/churximi/article/details/51174317

cdmd.cnki.com.cn/Article/CDMD-10213-1015980148.htm
中文电子病历命名实体识别研究

http://cdmd.cnki.com.cn/Article/CDMD-10213-1014081807.htm 电子病历实体关系抽取研究

http://www.ltp-cloud.com/intro/
“语言云” 以哈工大社会计算与信息检索研究中心研发的 “语言技术平台（LTP）” 为基础，为用户提供高效精准的中文自然语言处理云服务。使用 “语言云” 非常简单，只需要根据 API 参数构造 HTTP 请求即可在线获得分析结果，而无需下载 SDK 、无需购买高性能的机器，同时支持跨平台、跨语言编程等。 2014年11月，哈工大联合科大讯飞公司共同推出 “哈工大-讯飞语言云”，借鉴了讯飞在全国性大规模云计算服务方面的丰富经验，显著提升 “语言云” 对外服务的稳定性和吞吐量，为广大用户提供电信级稳定性和支持全国范围网络接入的语言云服务，有效支持包括中小企业在内开发者的商业应用需要。

wanghaisheng commented 8 years ago

TransE是现在非常流行的知识表示学习方法，我们组韩旭同学最近对之前开源的KB2E（ O网页链接）中的TransE代码进行了优化，在CPU环境下将训练速度提升了近40倍，过去需要训练2个多小时的数据现在只需要4分钟就能够完成。欢迎使用 Fast-TransE O网页链接 https://github.com/thunlp/Fast-TransE

clear-datacenter / plan

nlp #7

Berkeley

Cornell

Cambridge

MIT

Edinburgh

Brown

U.Penn

U.Maryland

目前常用的自然语言处理开源项目/开发包有哪些？