TFIDF requires a corpus to compare

andrewtavis / kwx

BERT, LDA, and TFIDF based keyword extraction in Python

BSD 3-Clause "New" or "Revised" License

70 stars 10 forks source link

TFIDF requires a corpus to compare #27

Open AbhiPawar5 opened 3 years ago

AbhiPawar5 commented 3 years ago

Hi Andrew, I was trying the Keyword Extraction API with TF-IDF, the code is: bert_kws = extract_kws( method="TFIDF", # "BERT", "LDA", "TFIDF", "frequency" bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens", text_corpus=corpus_no_ngrams, # automatically tokenized if using LDA input_language=input_language, output_language=None, # allows the output to be translated num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, # for TFIDF ignore_words=ignore_words, prompt_remove_words=True, # check words with user show_progress_bar=True, batch_size=5, )

Which returns the error, AssertionError: TFIDF requires another text corpus to be passed to the corpuses_to_compare argument.

I wonder why we require corpus to compare for keyword extraction? Thanks!

andrewtavis commented 3 years ago

Hi Abhishek,

The necessity to have a corpus to compare for TFIDF comes from the "IDF" part - Inverse Document Frequency. The way that kwx works is that everything that you're passing in via your dataframe or other input is treated as a single "document" from which topics are derived for LDA and BERT, and then term frequencies are found for TFIDF. Without something to compare, there's no way for TFIDF to figure out which words are more relevant to what it is that has been passed, as there's no reference. If your inputs are large, then you could treat each as if it's its own document and compare across them.

I'd be happy to chat a bit more on this if you wanted to send along a better description of what your inputs are :)

The wiki for kwx also has a resources for models page that has some good links for TFIDF and the other models, if you're interested!

Thanks again for writing :)

andrewtavis commented 3 years ago

A further explanation on this: if you look at my package wikirec, there we're using TFIDF to find the terms that appear more frequently in any given Wikipedia article when compared to other articles. In that case we have different documents to compare, but for keyword extraction purposes the likely use is that we want to know what the keywords are for the whole corpus - i.e. all the individual parts should be combined.

The usage case for this comes from the freelance that I did that originally produced this. In that the question was finding keywords from surveys, where TFIDF in that case can be used to derive what words are relevant to respondents from one survey by comparing the responses from that survey to those of other surveys. Also, as seen in examples/kw_extraction, we could also segment the original corpus and use TFIDF to find keywords that are more relevant for the segment in comparison to the rest :)