csurfer / rake-nltk

Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.
https://csurfer.github.io/rake-nltk
MIT License
1.06k stars 150 forks source link

Problems with spacing #51

Closed igormis closed 3 years ago

igormis commented 3 years ago

Hi, I am trying to extract key phrases in a sentence and it works quite good. However when trying to decompose this sentence: S&P stocks are falling, whereas Google is struggling The model is splitting the sentence into 2 clause. However in the first clause it adds space before and after the &, like S & P. which makes problems in the following step of my algorithm (entity recognition). The code for initialization of rake is the following:

#Creating stopword list
coord_conj=[', and', ', or', ', but', ', nor', ', as', ', for', ', so', ', however,', '; ']
subord_conj=[ 'after', 'although', 'as', 'as if', 'as long as', 'as though', 'because', 'before', 'even if', 'even though', 'if', 'if only', 'in order that', 'now that', 'once', 'rather than', 'since', 'so that', 'though', 'till', 'unless', 'until', 'when', 'whenever', 'where', 'whereas', 'wherever', 'while', 'following', 'and the']
stopwords =  ['and the','amid', 'under', 'but', 'where', 'itself', 'himself', ' nor', 'whom', 'once','before', 'these','most', 'just', "that'll", "it's", 'other', 'or', 'theirs','them',  'those','how', 'any', 'against', 'again', 'yourself', 'as', 'some', 'until', 'during', 'yourselves', 'ours', 'at', 'while', 'him', 'same','few']
stopwords= stopwords + coord_conj + subord_conj
capital_stopwords=[]
for sw in stopwords:
    capital_stopwords.append(sw.capitalize())
stopwords = stopwords + capital_stopwords
r = Rake(stopwords = stopwords, punctuations = '\=_*^#@!~?><"‘', min_length = 2, max_length = 100)  
r.extract_keywords_from_text(text)
return(r.get_ranked_phrases())
csurfer commented 3 years ago
  1. Both stop words and punctuations are Optional[Set[str]] type.
  2. I think the issue here is wordpunct_tokenizer which gets used if a word tokenizer is not specified.
>>> import nltk
>>> nltk.tokenize.wordpunct_tokenize('S&P')
['S', '&', 'P']

You can either use other tokenizers nltk provides (TweetTokenizer?) or provide a tokenizer of your own and you should get the results you require.