clear-datacenter / plan

MIT License
46 stars 17 forks source link

nlp #7

Open wanghaisheng opened 8 years ago

wanghaisheng commented 8 years ago

斯坦福大学CS224d课程 http://blog.csdn.net/han_xiaoyang/article/details/51567822

wanghaisheng commented 8 years ago

http://torch.ch/blog/2016/07/25/nce.html 语言模型

wanghaisheng commented 8 years ago

https://mp.weixin.qq.com/s?__biz=MzA4OTk5OTQzMg==&mid=2449231335&idx=1&sn=d3ba98841e85b7cea0049cc43b3c16ca 基于Deep Learning的中文分词尝试

wanghaisheng commented 8 years ago

数据集 HFL-RC: A Chinese Reading Comprehension Dataset http://hfl.iflytek.com/chinese-rc/ http://arxiv.org/abs/1607.02250 【哈工大讯飞联合实验室发布中文阅读理解数据集】

2016年7月18日,哈工大讯飞联合实验室(HFL)发布填空型中文阅读理解数据集,其中包括《人民日报》新闻数据集和“儿童读物”数据集(HFL-RC: People Daily and CFT dataset)。

在英文阅读理解方面,已有Google DeepMind CNN/Daily Mail数据集,Facebook CBTest数据集,但一直缺少中文阅读理解数据集。此次,HFL发布的数据集不但填补了中文阅读理解的空白,而且与上述两个英文数据集不同,在HFL发布的“儿童读物”数据集中还包含了人工问题,人工问题比自动构造的问题更难回答,这为阅读理解的研究提出了新的挑战。

此外,我们还设计了一种简单有效的神经网络来解决填空型阅读理解问题,并获得了良好的效果。 哈工大讯飞联合实验室(HFL)发力研究“阅读理解”这个备受业界关注的人工智能难题,相关成果将陆续发布。

wanghaisheng commented 8 years ago

topwords 实验 登录用户为wanghs 当前目录为 /home/wanghs/projects/ 下载源码 git clone https://github.com/qf6101/topwords

上传测试数据到hdfs 创建上传的文件夹

[wanghs@node3 test_data]$ sudo -u hdfs hadoop dfs -mkdir -p /data/topwords/test_data 如果不修改权限则会出现 [wanghs@node3 test_data]$ sudo -u hdfs hadoop fs -copyFromLocal /home/wanghs/projects/topwords/test_data/story_of_stone.txt /data/topwords/test_data copyFromLocal: `/home/wanghs/projects/topwords/test_data/story_of_stone.txt': No such file or directory 按照如下修改文件夹权限 [wanghs@node3 test_data]$ sudo -u hdfs hadoop fs -chown -R wanghs:wanghs /data 复制本地文件到HDFS [wanghs@node3 test_data]$ hadoop fs -copyFromLocal story_of_stone.txt /data/topwords/test_data

[wanghs@node3 test_data]$ hadoop fs -ls /data/topwords/test_data Found 1 items -rw-r--r-- 3 wanghs wanghs 2633092 2016-08-16 18:56 /data/topwords/test_data/story_of_stone.txt

同样的创建结果保存目录

[wanghs@node3 test_data]$ sudo -u hdfs hadoop fs -mkdir -p /data/topwords/output

修改配置文件

#!/usr/bin/env bash

# get into the current directory
cd "$( cd "$( dirname "$0"  )" && pwd  )"

##### The Parameters You Need to Predefine Start #####

# set the environment variables
HADOOP_HOME="/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop"
SPARK_HOME="/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark"
topwords_jar="../release/topwords-1.0.jar"  #topwords jar file

# set the arguments
inputLoc=" /data/topwords/test_data/*"  #location of input corpus in HDFS
outputLoc="/data/topwords/output" #location of output dictionary and segmented corpus in HDFS
tauL="6"  #threshold of word length
tauF="100"  #threshold of word frequency
numIterations="5"  #number of iterations
convergeTol="1E-3"  #convergence tolerance
textLenThld="2000"  #preprocessing threshold of text length
useProbThld="1E-8"  #prune threshold of word use probability
wordBoundaryThld="0.0"  #segment threshold of word boundary score (use segment tree if set to <= 0)
numPartitions="5000"  #number of partitions
executor_memory="5G"  #memory allocation for each executor
num_executors="40"  #number of executors allocated
executor_cores="1"  #number of cores allocated for each executor
queue="queue_name"  #yarn queue

##### The Parameters You Need to Predefine End #####

# execute the TopWORDS algorithm

function_exec(){
${SPARK_HOME}/bin/spark-submit \
--class io.github.qf6101.topwords.TopWORDSApp \
--master yarn \
--deploy-mode cluster \
--name topwords \
--executor-memory $executor_memory \
--num-executors $num_executors \
--executor-cores $executor_cores \
--queue $queue \
${topwords_jar} \
--inputLoc $inputLoc \
--outputLoc $outputLoc \
--tauL $tauL \
--tauF $tauF \
--numIterations $numIterations \
--convergeTol $convergeTol \
--textLenThld $textLenThld \
--useProbThld $useProbThld \
--wordBoundaryThld $wordBoundaryThld \
--numPartitions $numPartitions
}

output="topwords.running"
function_exec > ${output} 2>&1

sleep 60s
app_id=`grep -Eo "application_[0-9]+_[0-9]+" ${output} | head -n1`
logfile=${output}.log
${HADOOP_HOME}/bin/yarn logs -applicationId ${app_id} > ${logfile}
#check execution result
if grep -i ERROR ${logfile}
then
        rm $output
        echo "Exception occurred in $0. See `readlink -f ${logfile}`"
        exit 1
else
        rm $output
        echo "Finish $0."
        exit 0
fi

修改/etc/profile


export HADOOP_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop
export HIVE_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hive
export HBASE_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hbase
export HADOOP_HDFS_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-hdfs
export HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-mapreduce
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-hdfs
export HADOOP_LIBEXEC_DIR=${HADOOP_HOME}/libexec
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HDFS_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_YARN_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-yarn
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop

报错如下 无法解决 搁置


[wanghs@node3 topwords]$ bash deploy/sbin/topwords_local.sh
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
        at io.github.qf6101.topwords.TopWORDSApp$.main(TopWORDSApp.scala:15)
        at io.github.qf6101.topwords.TopWORDSApp.main(TopWORDSApp.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        ... 11 more
Running TopWORDS fail at Wed Aug 17 13:32:53 CST 2016

使用国内镜像 再次进行尝试 https://hub.tenxcloud.com/repos/sdvdxl/spark-native-yarn

wanghaisheng commented 8 years ago

http://nbviewer.jupyter.org/github/jayantj/gensim/blob/683720515165a332baed8a2a46b6711cefd2d739/docs/notebooks/Word2Vec_FastText_Comparison.ipynb
使用fasttext将解析后的xps报告进行一次试验

wanghaisheng commented 8 years ago

What industries are next to be disrupted by NLP and Text Analysis? http://blog.aylien.com/nlp-text-analysis-insurance-legal-customer-service/

wanghaisheng commented 8 years ago

词性标注 https://arxiv.org/pdf/1609.07053v1.pdf To address the first question, we will look at convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which are both highly prominent approaches in recent natural language pro- cessing (NLP) literature. A recent development is the emergence of deep residual networks (ResNets), a building block for CNNs. ResNets consist of several stacked residual units, which can be thought of as a collection of convolutional layers coupled with a ‘shortcut’ which aids the propagation of the signal in a neural network. This allows for the construction of much deeper networks, since keeping a ‘clean’ information path in the network facilitates optimisation (He et al., 2016). ResNets have recently shown state-of-the-art performance for image classification tasks (He et al., 2015; He et al., 2016), and have

wanghaisheng commented 8 years ago

Welcome to the awesome-nlp wiki!

微博词 https://github.com/StevenLOL/chinese-weibo-noun/tree/8c802b4f6139cecad96e907a0632244cbc17dab8

中英文维基百科语料上的Word2Vec实验 http://www.52nlp.cn/%e4%b8%ad%e8%8b%b1%e6%96%87%e7%bb%b4%e5%9f%ba%e7%99%be%e7%a7%91%e8%af%ad%e6%96%99%e4%b8%8a%e7%9a%84word2vec%e5%ae%9e%e9%aa%8c

http://licstar.net/archives/262 维基百科简体中文语料的获取 https://github.com/dsindex/syntaxnet
https://github.com/tensorflow/models/tree/master/syntaxnet

Comparison of FastText and Word2Vec

TopWORDS1是近期在PNAS发表的一种方法,它在没有任何先验知识的条件下,快速地从大规模中文语料里学习出一个排序的词典以及语料文本的分词结构。 http://qf6101.github.io/machine%20learning/2016/07/01/TopWORDS

https://mp.weixin.qq.com/s?__biz=MzAxMzA2MDYxMw==&mid=2651555619&idx=1&sn=4cdc0e19cf259845825f6a95707e1105 【干货】邓柯:基于统计学模型的无指导中文文本分析

在谷歌最新的深度学习论文中,Oriol Vinyals与Geoff Hinton等人把LSTM用到了NLP的Parsing问题上,并且得到了不错的结果。O http://arxiv.org/pdf/1412.7449v1.pdf

我们在NAACL-16 上的Tutorial: Recent Progress on DL 4 NLP 的slides O网页链接 , 以及我在QA workshop上的talk: Towards Neural-Net-based QA 的slides O网页链接 "Deep Learning For Natural Language Processing"上周日在香港城市大学 Seminar on DL4NLP ( O http://lt.cityu.edu.hk/Research/cel/ http://nlp.fudan.edu.cn/xpqiu/slides/20160618_DL4NLP@CityU.pdf 自然语言处理前沿技术研讨会 暨清华大学“计算未来”硕博论坛顺利召开

http://www.cipsc.org.cn/qngw/?p=800

Language Understanding for Text-based Games using Deep Reinforcement Learning #PaperWeekly# http://rsarxiv.github.io/2016/06/27/Language-Understanding-for-Text-based-Games-using-Deep-Reinforcement-Learning-PaperWeekly/

【哈工大讯飞联合实验室在零指代消解问题上取得进展】

在自然语言理解的零代词消解问题上,人工标注的训练数据非常有限,深度学习的能力无法发挥出来。哈工大讯飞联合实验室的研究人员们提出了一种自动构造大规模“伪训练数据”的方法:在一篇文本中,如果一个名词出现了两次,则将后面出现的一次变成空槽(零代词),从而构成一个“零代词消解”实例,而这个零代词的先行词(即应该填入空槽的答案)就是该名词本身。如此,可以构造无限量的“伪训练数据”,这些数据与真实数据的特征不完全一致,但数量巨大,可以用来做预训练,然后再结合数量很有限的“真实训练数据”,在统一的深度学习模型框架下,很快取得了超过现有最好的中文零指代消解指标5个百分点的进步。此方法简洁纯净,易于领域移植,且未来还有很大的提升空间。

论文已经放在了arXiv上:O网页链接,作者:刘挺、崔一鸣、尹庆宇、王士进、张伟男、胡国平,欢迎同行们批评指正。

零指代问题举例:“小明去找他妈妈了,【】一直没回来”,到底是“谁没回来”,应该是“小明”,而不是“妈妈”

当前国内外在自然语言处理领域的研究热点和难点? https://www.zhihu.com/question/30305058

facebook fasttext https://github.com/facebookresearch/fastText https://github.com/kemaswill/fasttext_torch

我觉得这个工作的最有意思的地方是,能够为实体找到最有信息量的句子,这些句子往往是该实体的定义或描述。这样,在构建知识图谱时,我们就可以自动为新增的实体构建对应的文本描述信息了。[微笑]

cs.CL daily# 当前知识表示存在两个挑战:1、如何更好地利用entity的context;2、如何发现与entity相关的句子;针对这两个问题,本文提出了一种从多个句子中学习表示的模型。 Knowledge Representation via Joint Learning of Sequential Text and Knowledge Graphs O

http://arxiv.org/abs/1609.07075

wanghaisheng commented 8 years ago

blogs http://colah.github.io/
http://t.cn/Rq8UhU6

[干货]深度学习即将攻陷的下一个领域:NLP——ACL2016优秀论文解读(上篇) http://mp.weixin.qq.com/s?__biz=MzIzOTU0NTQ0MA==&mid=2247483864&idx=1&sn=75136bfb9afc4e4f3ed1d3697151aef3 [Deep Learning in NLP词向量和语言模型](http://licstar.net/archives/328

《How to Generate a Good Word Embedding?》导读
http://www.hankcs.com/nlp/

【立委科普:自然语言系统架构简说】 http://blog.sciencenet.cn/home.php?mod=space&uid=362400&do=blog&quickforward=1&id=981742

《深度学习与自然语言处理(斯坦福cs224d)》by 寒小阳, 龙心尘 Lecture1:O http://blog.csdn.net/longxinchen_ml/article/details/51567960 深度学习与自然语言处理(7)_斯坦福cs224d 语言模型,RNN,LSTM与GRU http://blog.csdn.net/longxinchen_ml/article/details/51940065

(Slides)《Using Text Embeddings for Information Retrieval》by Bhaskar Mitra O

深度学习浪潮中的自然语言处理技术

http://t.cn/R5GX0Jc http://pan.baidu.com/s/1kVlN3YB

wanghaisheng commented 8 years ago

phd paper 童鞋3年前推荐过一个博士论文列表
http://www.weibo.com/1657470871/zqgjp9sMK?type=comment#_rnd1462289209093

其实就斯坦福NLP,伯克利NLP, CMU LTI, JHU CSLP四个学校的近期博士论文看一看,领域概况就能了解一大半。

Berkeley

Cornell

Cambridge

MIT

Eisenstein ,J 2008 Gesture in Automatic Discourse processing

Eisenstein现任Gatech计算机学院助理教授,博士毕业于MIT的Barzilay门下,论文写的很有特色,是研究手势的,与普通的NLP论文截然不同

Edinburgh

Brown

U.Penn

[博士论文《基于神经网络的词和文档语义向量表示方法研究》](http://licstar.net/archives/687http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

wanghaisheng commented 8 years ago

Library 分词
评估 http://www.scholat.com/vpost.html?pid=4477
常用分词组件基准测试 https://github.com/ysc/cws_evaluation
浅谈中文分词
https://github.com/fxsjy/jieba/
https://github.com/NLPchina/ansj_seg
https://github.com/huaban/jieba-analysis
https://github.com/hankcs/HanLP

https://github.com/FudanNLP/fnlp
http://ictclas.nlpir.org/nlpir/

https://github.com/nltk/nltk NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing.

https://github.com/thunlp
https://github.com/NLPchina

https://github.com/WILAB-HIT/News/tree/master/2015/10/30
http://wi.hit.edu.cn/cemr/
https://github.com/woshialex/diagnose-heart
https://github.com/ysc/cws_evaluation
'RNNLM Toolkit - Recurrent Neural Network Language Modeling (RNNLM) Toolkit' by Intel Labs GitHub: O

https://github.com/NLPchina/ansj_seg 词性标注
https://www.zhihu.com/question/19929473

https://github.com/thunlp

目前包括:知识表示学习工具包KG2E,关键词抽取和标签推荐工具包THUTag,中文词法分析工具包THULAC,中文文本分类工具包THUCTC等。欢迎关注试用,并提出宝贵意见。[微笑]

https://github.com/lionsoul2014/jcseg Jcseg是基于mmseg算法的一个轻量级开源中文分词器,同时集成了关键字提取,关键短语提取,关键句子提取和文章自动摘要等功能,并且提供了最新版本的lucene, solr, elasticsearch的分词接口。 http://git.oschina.net/lionsoul/jcseg

目前常用的自然语言处理开源项目/开发包有哪些?

请从文本/语音两方面分别介绍一些。
[举报](#) [添加评论](#) [分享](#) [邀请回答](#)
按投票排序 [按时间排序](/question/19929473?sort=created)
### 19 个回答
[![](https://pic4.zhimg.com/f844abbf7_s.jpg) ](/people/liuliudong) [刘知远](/people/liuliudong),NLPer
[ff4415](https://www.zhihu.com/people/ff4415 "ff4415")、[格调七弦](https://www.zhihu.com/people/ge-diao-qi-xian "格调七弦")、[李木子](https://www.zhihu.com/people/li-mu-zi-88-71 "李木子") [等人赞同](javascript:;)
[编辑于 2016-03-31](/question/19929473/answer/90201148) [28 条评论](#) [感谢](#) [分享](#) [收藏](#) [没有帮助](#) [举报](#) [作者保留权利](/terms#sec-licence-1)
[![](https://pic4.zhimg.com/b493fcf67_s.jpg) ](/people/kong-mu) [孔牧](/people/kong-mu),软件工程师
[王二毛](https://www.zhihu.com/people/wang-han-46-82 "王二毛")、[潇涧](https://www.zhihu.com/people/hujiaweibujidao "潇涧")、[Alex Green](https://www.zhihu.com/people/alex-green-40 "Alex Green") [等人赞同](javascript:;)
[发布于 2011-11-24](/question/19929473/answer/13381815) [3 条评论](#) [感谢](#) [分享](#) [收藏](#) [没有帮助](#) [举报](#) [作者保留权利](/terms#sec-licence-1)
[![](https://pic4.zhimg.com/9a43bcb7f_s.jpg) ](/people/xiaozhibo) [肖智博](/people/xiaozhibo),兴趣所在
[王昊](https://www.zhihu.com/people/wang-hao-58 "王昊")、[solisinvicti](https://www.zhihu.com/people/yytbob "solisinvicti")、[没有人](https://www.zhihu.com/people/mei-you-ren-87 "没有人") [等人赞同](javascript:;)
[发布于 2013-07-17](/question/19929473/answer/17952336) [6 条评论](#) [感谢](#) [分享](#) [收藏](#) [没有帮助](#) [举报](#) [作者保留权利](/terms#sec-licence-1)
[![](https://pic1.zhimg.com/95cf898f3171bc58373237c0bdd8ab9c_s.jpg) ](/people/dai-lei) [戴磊](/people/dai-lei),自然语言处理,传统意义上的"好人"和”单…
[瞋德](https://www.zhihu.com/people/chen-yun-48-22 "瞋德")、[river](https://www.zhihu.com/people/riverphoenix1111111 "river")、[陈村](https://www.zhihu.com/people/xjiangxjxjxjx "陈村") 赞同
[发布于 2016-03-11](/question/19929473/answer/90188769) [1 条评论](#) [感谢](#) [分享](#) [收藏](#) [没有帮助](#) [举报](#) [作者保留权利](/terms#sec-licence-1)
[![](https://pic1.zhimg.com/da8e974dc_s.jpg) ](/people/yifan) [贺一帆](/people/yifan)
[lankaka](https://www.zhihu.com/people/lankaka "lankaka")、[Spirit_Dongdong](https://www.zhihu.com/people/spirit-dongdong "Spirit_Dongdong")、[刘毅](https://www.zhihu.com/people/gavin1332 "刘毅") [等人赞同](javascript:;)
[发布于 2013-06-05](/question/19929473/answer/17372131) [添加评论](#) [感谢](#) [分享](#) [收藏](#) [没有帮助](#) [举报](#) [作者保留权利](/terms#sec-licence-1)
[![](https://pic3.zhimg.com/1590fdda3cae8d6c7a01ef595d0258e6_s.png) ](/people/xpqiu) [邱锡鹏](/people/xpqiu),自然语言处理
[碧彤怡](https://www.zhihu.com/people/bi-tong-yi "碧彤怡")、[xuanjing huang](https://www.zhihu.com/people/xuanjing-huang "xuanjing huang")、[李奕](https://www.zhihu.com/people/li-yi-83-91 "李奕") [等人赞同](javascript:;)
[编辑于 2014-09-05](/question/19929473/answer/17854072) [11 条评论](#) [感谢](#) [分享](#) [收藏](#) [没有帮助](#) [举报](#) [作者保留权利](/terms#sec-licence-1)
[![](https://pic2.zhimg.com/48a091259_s.jpg) ](/people/pijili) [李丕绩](/people/pijili),Text Mining, Computer Vision, Machine …
[梦大人](https://www.zhihu.com/people/meng-da-ren "梦大人")、[simon](https://www.zhihu.com/people/zhaohuaipeng "simon")、[lxgone](https://www.zhihu.com/people/lxgone "lxgone") [等人赞同](javascript:;)
[发布于 2016-03-25](/question/19929473/answer/92228199) [添加评论](#) [感谢](#) [分享](#) [收藏](#) [没有帮助](#) [举报](#) [作者保留权利](/terms#sec-licence-1)
[![](https://pic3.zhimg.com/4cc285e12_s.jpg) ](/people/hobermallow) [mallow](/people/hobermallow)
[碧彤怡](https://www.zhihu.com/people/bi-tong-yi "碧彤怡")、[Yan Zhang](https://www.zhihu.com/people/yan-zhang-45-43 "Yan Zhang") 赞同
[发布于 2014-11-03](/question/19929473/answer/32831706) [添加评论](#) [感谢](#) [分享](#) [收藏](#) [没有帮助](#) [举报](#) [作者保留权利](/terms#sec-licence-1)
[![](https://pic1.zhimg.com/da8e974dc_s.jpg) ](/people/chen-yang-yang-3) [沈阳阳](/people/chen-yang-yang-3),为机器翻译技术应用的美好明天而奋斗
[勾陈一](https://www.zhihu.com/people/gou-chen-___ "勾陈一") 赞同
[发布于 2014-11-05](/question/19929473/answer/32940932) [添加评论](#) [感谢](#) [分享](#) [收藏](#) [没有帮助](#) [举报](#) [作者保留权利](/terms#sec-licence-1)
[![](https://pic4.zhimg.com/b8ba74aef_s.jpg) ](/people/wu-bo-wen-99) [武博文](/people/wu-bo-wen-99),NLP/Resys with Java/Python
[郭同jetNLP](https://www.zhihu.com/people/guotong198801 "郭同jetNLP") 赞同
wanghaisheng commented 8 years ago

nlp与医学

http://wi.hit.edu.cn/cemr/ 基于中文电子病历文本的在线演示
这里的命名实体识别再辅以机构名称、人名、等识别 结合pullword 可以辅助数据提取

https://github.com/wanghaisheng/models/tree/master/syntaxnet 打算用google的这个tensorflow的模型来训练我们的分词和词性标注

http://www.xunsearch.com/scws/

http://xueshu.baidu.com/s?wd=paperuri%3A%28680725de0f671ad53355c02c8c236392%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fcdmd.cnki.com.cn%2FArticle%2FCDMD-10004-1015611084.htm&ie=utf-8&sc_us=12482090631496785537

面向命名实体抽取的大规模中医临床病历语料库构建方法研究

http://blog.sciencenet.cn/home.php?mod=space&uid=362400&do=blog&id=1001697 关注一下这个人的博客了解整体情况

http://blog.sciencenet.cn/home.php?mod=space&uid=362400&do=blog&id=932462 信息抽取系统架构

Zachary Lipton# RNN/LSTM序列建模及其在医学和NLP中的应用。1综述RNN for Sequence Learning:论文(38页) +幻灯片(80页) 2从儿童重症室感知时序数据中多标签分类疾病Learning to Diagnose with LSTM Recurrent Neural Networks [ICLR16] 3博客(KDnuggets,IEEE Spectrum) http://t.cn/RqKBanr

http://zacklipton.com/

nlp与结构化提取
nlp与关键词提取

http://t.cn/zQivPUK deep learning in nlp →词向量与语言模型,,,通俗的不能再通俗了。推荐。

nlp 关键技术 nlp常见处理流程 以bot为例
http://licstar.net/archives/328

第44期:深入NLP——看中文分词如何影响你的生活点滴 | 硬创公开课 http://mp.weixin.qq.com/s?__biz=MzIzMjIwNzM4OA==&mid=2650042180&idx=1&sn=0a78ab62e41a6ddeabb00729335911c7

哈工大LTP和中科院NLPIR中文分词比较

http://blog.csdn.net/churximi/article/details/51174317

cdmd.cnki.com.cn/Article/CDMD-10213-1015980148.htm
中文电子病历命名实体识别研究

http://cdmd.cnki.com.cn/Article/CDMD-10213-1014081807.htm 电子病历实体关系抽取研究

http://www.ltp-cloud.com/intro/
“语言云” 以哈工大社会计算与信息检索研究中心研发的 “语言技术平台(LTP)” 为基础,为用户提供高效精准的中文自然语言处理云服务。 使用 “语言云” 非常简单,只需要根据 API 参数构造 HTTP 请求即可在线获得分析结果,而无需下载 SDK 、无需购买高性能的机器,同时支持跨平台、跨语言编程等。 2014年11月,哈工大联合科大讯飞公司共同推出 “哈工大-讯飞语言云”,借鉴了讯飞在全国性大规模云计算服务方面的丰富经验,显著提升 “语言云” 对外服务的稳定性和吞吐量,为广大用户提供电信级稳定性和支持全国范围网络接入的语言云服务,有效支持包括中小企业在内开发者的商业应用需要。

wanghaisheng commented 8 years ago

TransE是现在非常流行的知识表示学习方法,我们组韩旭同学最近对之前开源的KB2E( O网页链接 )中的TransE代码进行了优化,在CPU环境下将训练速度提升了近40倍,过去需要训练2个多小时的数据现在只需要4分钟就能够完成。欢迎使用 Fast-TransE O网页链接 https://github.com/thunlp/Fast-TransE