boudinfl / pke

Python Keyphrase Extraction module
GNU General Public License v3.0
1.57k stars 291 forks source link

SnowballStemmer and Spacy-model use different langcodes #212

Closed ghost closed 1 year ago

ghost commented 1 year ago

Hi, I recently starting using this package and I have come across an issue. In my code below you can see that I am using the basic structure with the addition of language detection (since the language of my documents can be German or English) and 'stemming' in the extractor.load_document() function enabled. The problem is that when a German document is detected (output of detect(filecontent) is 'de') 'de' is passed into the load_document() function and a stemming error occurs, since the SnowballStemmer doesn't detect 'de' as German (it uses 'ge' for German). Even turning normalization to 'none' didn't solve this issue. So in order to fix that issue I tried to change 'de' to 'ge' before passing it as an argument into the load_document() function. But that causes a different error since "there is no spacy-model for 'ge' language". Ultimately my solution was to go into the lang.py file of the PKE package and change the langcode for German from "ge" : "german" to "de" : "german". With this change I was able to pass 'de' as an argument into the load_document() function with stemming enabled and no further issues. I hope that I described the issue clearly. If I made a programming mistake, please let me know. Also thank you for providing this package, it is very useful :)

import pke from langdetect import detect

#scan for language, filecontent = content of a .txt-file that I want to extract keywords from lang = detect(filecontent)

# initialize keyphrase extraction model, here TopicRank extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be a simple # test string and preprocessing is carried out using spacy extractor.load_document(input=filecontent, language=lang, normalization='stemming')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns # and adjectives (i.e. (Noun|Adj)*) extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as # (keyphrase, score) tuples keyphrases = extractor.get_n_best(n=10)

ygorg commented 1 year ago

Hi, thanks for this issue, it is linked to #215 #216 #219 It is now fixed in #225