boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.56k stars 290 forks source link

Does not support Chinese? #220

Closed wf4867612 closed 1 year ago

wf4867612 commented 1 year ago

Does not support Chinese?

tagucci commented 1 year ago

@wf4867612 You can extract keyphrases from Chinese text as follows.

Before running the code, install pke and Chinese spaCy model.

$ pip install git+https://github.com/boudinfl/pke.git
$ python -m spacy download zh_core_web_sm

Small modifications to Minimal example.

import spacy
import pke

nlp = spacy.load("zh_core_web_sm")

text = """自然语言处理( Natural Language Processing, NLP)是计算机科学领域与人工智能领域中的一个重要方向。
它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
自然语言处理是一门融语言学、计算机科学、数学于一体的科学。
因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系,但又有重要的区别。
自然语言处理并不是一般地研究自然语言,而在于研制能有效地实现自然语言通信的计算机系统,特别是其中的软件系统。
因而它是计算机科学的一部分。"""
doc = nlp(text)

extractor = pke.unsupervised.TopicRank()
extractor.load_document(input=doc, normalization=None)

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
candidates = extractor.get_n_best(n=5)
for candidate in candidates:
    keyphrase, score = candidate
    print(keyphrase)
"""
自然 语言
计算机 科学 领域
有效 通信
语言学
各种 理论
"""