Language | version |
---|---|
Python | |
Rust |
LTP(Language Technology Platform) 提供了一系列中文自然语言处理工具,用户可以使用这些工具对于中文文本进行分词、词性标注、句法分析等等工作。
如果您在工作中使用了 LTP,您可以引用这篇论文
@inproceedings{che-etal-2021-n,
title = "N-{LTP}: An Open-source Neural Language Technology Platform for {C}hinese",
author = "Che, Wanxiang and
Feng, Yunlong and
Qin, Libo and
Liu, Ting",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-demo.6",
doi = "10.18653/v1/2021.emnlp-demo.6",
pages = "42--49",
abstract = "We introduce N-LTP, an open-source neural language technology platform supporting six fundamental Chinese NLP tasks: lexical analysis (Chinese word segmentation, part-of-speech tagging, and named entity recognition), syntactic parsing (dependency parsing), and semantic parsing (semantic dependency parsing and semantic role labeling). Unlike the existing state-of-the-art toolkits, such as Stanza, that adopt an independent model for each task, N-LTP adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks. In addition, a knowledge distillation method (Clark et al., 2019) where the single-task model teaches the multi-task model is further introduced to encourage the multi-task model to surpass its single-task teacher. Finally, we provide a collection of easy-to-use APIs and a visualization tool to make users to use and view the processing results more easily and directly. To the best of our knowledge, this is the first toolkit to support six Chinese NLP fundamental tasks. Source code, documentation, and pre-trained models are available at https://github.com/HIT-SCIR/ltp.",
}
参考书: 由哈工大社会计算与信息检索研究中心(HIT-SCIR)的多位学者共同编著的《自然语言处理:基于预训练模型的方法 》(作者:车万翔、郭江、崔一鸣;主审:刘挺)一书现已正式出版,该书重点介绍了新的基于预训练模型的自然语言处理技术,包括基础知识、预训练词向量和预训练模型三大部分,可供广大 LTP 用户学习参考。
# 方法 1: 使用清华源安装 LTP
# 1. 安装 PyTorch 和 Transformers 依赖
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple torch transformers
# 2. 安装 LTP
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple ltp ltp-core ltp-extension
# 方法 2: 先全局换源,再安装 LTP
# 1. 全局换 TUNA 源
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# 2. 安装 PyTorch 和 Transformers 依赖
pip install torch transformers
# 3. 安装 LTP
pip install ltp ltp-core ltp-extension
注: 如果遇到任何错误,请尝试使用上述命令重新安装 ltp,如果依然报错,请在 Github issues 中反馈。
import torch
from ltp import LTP
# 默认 huggingface 下载,可能需要代理
ltp = LTP("LTP/small") # 默认加载 Small 模型
# 也可以传入模型的路径,ltp = LTP("/path/to/your/model")
# /path/to/your/model 应当存在 config.json 和其他模型文件
# 将模型移动到 GPU 上
if torch.cuda.is_available():
# ltp.cuda()
ltp.to("cuda")
# 自定义词表
ltp.add_word("汤姆去", freq=2)
ltp.add_words(["外套", "外衣"], freq=2)
# 分词 cws、词性 pos、命名实体标注 ner、语义角色标注 srl、依存句法分析 dep、语义依存分析树 sdp、语义依存分析图 sdpg
output = ltp.pipeline(["他叫汤姆去拿外衣。"], tasks=["cws", "pos", "ner", "srl", "dep", "sdp", "sdpg"])
# 使用字典格式作为返回结果
print(output.cws) # print(output[0]) / print(output['cws']) # 也可以使用下标访问
print(output.pos)
print(output.sdp)
# 使用感知机算法实现的分词、词性和命名实体识别,速度比较快,但是精度略低
ltp = LTP("LTP/legacy")
# cws, pos, ner = ltp.pipeline(["他叫汤姆去拿外衣。"], tasks=["cws", "ner"]).to_tuple() # error: NER 需要 词性标注任务的结果
cws, pos, ner = ltp.pipeline(["他叫汤姆去拿外衣。"], tasks=["cws", "pos", "ner"]).to_tuple() # to tuple 可以自动转换为元组格式
# 使用元组格式作为返回结果
print(cws, pos, ner)
use std::fs::File;
use itertools::multizip;
use ltp::{CWSModel, POSModel, NERModel, ModelSerde, Format, Codec};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let file = File::open("data/legacy-models/cws_model.bin")?;
let cws: CWSModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
let file = File::open("data/legacy-models/pos_model.bin")?;
let pos: POSModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
let file = File::open("data/legacy-models/ner_model.bin")?;
let ner: NERModel = ModelSerde::load(file, Format::AVRO(Codec::Deflate))?;
let words = cws.predict("他叫汤姆去拿外衣。")?;
let pos = pos.predict(&words)?;
let ner = ner.predict((&words, &pos))?;
for (w, p, n) in multizip((words, pos, ner)) {
println!("{}/{}/{}", w, p, n);
}
Ok(())
}
深度学习模型(🤗HF/🗜 压缩包) | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
---|---|---|---|---|---|---|---|
🤗Base 🗜Base | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | 39.12 |
🤗Base1 🗜Base1 | 99.22 | 98.73 | 96.39 | 79.28 | 89.57 | 76.57 | --.-- |
🤗Base2 🗜Base2 | 99.18 | 98.69 | 95.97 | 79.49 | 90.19 | 76.62 | --.-- |
🤗Small 🗜Small | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 43.13 |
🤗Tiny 🗜Tiny | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 53.22 |
感知机算法模型(🤗HF/🗜 压缩包) | 分词 | 词性 | 命名实体 | 速度(句/s) | 备注 |
---|---|---|---|---|---|
🤗Legacy 🗜Legacy | 97.93 | 98.41 | 94.28 | 21581.48 | 性能详情 |
注:感知机算法速度为开启 16 线程速度
# 使用 HTTP 链接下载
# 确保已安装 git-lfs (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/LTP/base
# 使用 ssh 下载
# 确保已安装 git-lfs (https://git-lfs.com)
git lfs install
git clone git@hf.co:LTP/base
# 下载压缩包
wget http://39.96.43.154/ltp/v4/base.tgz
tar -zxvf base.tgz -C base
from ltp import LTP
# 在路径中给出模型下载或解压后的路径
# 例如:base 模型的文件夹路径为 "path/to/base"
# "path/to/base" 下应当存在 "config.json"
ltp = LTP("path/to/base")
make bdist
感知机算法
深度学习算法