boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.56k stars 290 forks source link

How to use myself stopwords? #161

Closed hjing100 closed 3 years ago

hjing100 commented 3 years ago
nlp = spacy.load('zh_core_web_sm')

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.YAKE()  # TopicRank

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
extractor.load_document(input=inputfile, language='zh',spacy_model=nlp) 

if my_stoplist!=None:            
    # 使用自定义的停用词词表
    extractor.candidate_filtering(stoplist=my_stoplist)

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=3)

WARNING:root:No stopwords for 'zh' language. WARNING:root:Please provide custom stoplist if willing to use stopwords. Or update nltk's stopwotk.download('stopwords') WARNING:root:No stemmer for 'zh' language. WARNING:root:Stemming will not be applied.

ygorg commented 3 years ago

Hi, thanks for using pke.

Chinese is not supported in nltk, that's why you get the errors.

You can pass your custom stoplist in candidate_selection, candidate_filtering (if you use it) and candidate_weighting (if the method needs stopwords). Not filtering candidates based on stopwords can alter the keyphrases scores for methods that compute scores globally (graph based methods, Tf-Idf if computing the Tf-Idf matrix with stopwords), but it won't change anything for EmbedRank for example, because the score is not dependent of the other candidates.