LIAAD / yake

Single-document unsupervised keyword extraction
https://liaad.github.io/yake
Other
1.64k stars 227 forks source link

Stopwords being ignored #70

Open chaturv3di opened 2 years ago

chaturv3di commented 2 years ago

I am passing the set of English stopwords which I create from yake/StopwordsList/stopwords_en.txt.

text = "YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain. To demonstrate the merits and the significance of our proposal, we compare it against ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). Experimental results carried out on top of twenty datasets (see Benchmark section below) show that our methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains. In addition to the python package here described, we also make available a demo, an API and a mobile app."

language = "en"
max_ngram_size = 5
deduplication_thresold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 5

# Location of the file downloaded from https://github.com/LIAAD/yake/blob/master/yake/StopwordsList/stopwords_en.txt
stopwords_file = os.path.join(home_dir, "data_txt", "yake_stopwords_en.txt")
with open(stopwords_file, 'r') as sw_f:
    yake_stopwords = set(sw_f.read().lower().split("\n"))

yake_kw_extractor = yake.KeywordExtractor(lan=language, 
                                          n=max_ngram_size, 
                                          dedupLim=deduplication_thresold, 
                                          dedupFunc=deduplication_algo, 
                                          windowsSize=windowSize, 
                                          top=numOfKeywords, 
                                          features=None, 
                                          stopwords=yake_stopwords)

yake_kw_extractor.extract_keywords(text)

And the results end up containing stopwords like of, a, from, etc.

[('trained on a particular set', -60.326928913747196),
 ('keywords of a text', -0.665864990295941),
 ('important keywords of a text', -0.31206738772455755),
 ('light-weight unsupervised automatic keyword extraction', 0.00029233948201177757),
 ('statistical features extracted from single', 0.0008477866813335354)]

If I invoke the method with parameter stopwords=None, the results don't change. Am I doing something silly here?

Thanks a lot.

secsilm commented 1 year ago

I guess the stopwords-removing step is done in the last steps, i.e.:

  1. split words
  2. extract candidates
  3. score, dedup and remove stopwords.
JeremyBrent commented 7 months ago

@chaturv3di I am running in the same issue, have you found a solution?

chaturv3di commented 7 months ago

Unfortunately not.

JeremyBrent commented 7 months ago

Not sure if secsilm was referring to this, but I am thinking about using my stopwords as a postprocessing step outside of the Yake Class.

chaturv3di commented 7 months ago

That's not elegant but works. Eg if I wanted up to 4 word phrases without stopwords, but if I were to remove stopwords in post processing, then I'd need to fetch up to 6 word phrases hoping that up to 2 of them will be stopwords. That is clunky and increases the compute time.

OTOH, there doesn't seem to be another option right now.