BALaka-18 / rake_new2

A Python library that enables smooth keyword extraction from any text using the RAKE(Rapid Automatic Keyword Extraction) algorithm.
MIT License
29 stars 20 forks source link

Comparison against TF-IDF Vectorizer (from scratch) #8

Open BALaka-18 opened 4 years ago

BALaka-18 commented 4 years ago

Description

TF-IDF is one of the most famous algorithms when it comes to keyword extraction from text. Your task is to create a function that will extract keywords from text using the TF-IDF algorithm and compare the results against this library. How similar / different are the results ?

NOTE : You have to build the Tf-idf algorithm for keyword extraction from scratch. You will then compare its performance against sklearn's TfidfVectorizer and rake_new2.

For reference :

For your reference, you may read this link

Folder Structure, Function details

Create a folder tfidf_vectorizer in the root directory. The folder must contain a .py file that will contain the function for extracting the keywords from text using the Tfidf algorithm written from scratch.

Structure : tfidf_vectorizer/extract_keywords_tfidf_scratch.py

Acceptance Criteria

Definition of Done

Time Estimation

2.5-3 hours (or more if needed)

2bit-hack commented 4 years ago

Is it okay if I work on this issue during Hacktoberfest?

BALaka-18 commented 4 years ago

Is it okay if I work on this issue during Hacktoberfest?

Yes sure @2bit-hack . Just make sure the PR is made after Oct 1. Assigning it to you.

2bit-hack commented 4 years ago

I have a few questions regarding the implementation:

BALaka-18 commented 4 years ago

I have a few questions regarding the implementation:

  • Will there be a corpus of documents provided against which to run the TF-IDF algorithm? (Something like get_keywords(document, documents, stopwords)?)
  • Will these documents be text files or plaintext strings (or potentially both)?
  • How do I measure performance / accuracy of the implementation versus the other algorithms? Do I perform a diff?
  1. You have to run the tf-idf algorithm on only one document. That document can be a large corpus, or an article, or just 3 lines of text. The function will take in the document, the stopwords and the max_num of keywords the user wants to extract. If max_num == None, then extract all keywords. Else, extract max_num keywords. Default : None The function will return the keywords along with their tf-idf weights.
  2. Answered in the above point.
  3. The major difference that you will find is the number of keywords extracted, as tf-idf will go over all words but rake_new2 returns phrases mostly. So, in order to compare :
    • Choose a small text to work one. Using your knowledge, mark out what according to you are key words/phrases in the text.
    • Extract keywords using both rake_new2 and tf-idf and calculate their recall / precision score (against the keywords you manually extracted). READ THIS PAPER TO GET AN IDEA OF WHAT I MEAN TO SAY.
    • Time both the algorithms (using Python's time library) to see which algorithm extracts keywords faster.
2bit-hack commented 4 years ago

tf-idf is a corpus based algorithm whereas RAKE can work on single documents. Like, if I have a collection of documents, I can figure out the tf-idf scores for words in one document against all the documents in the corpus. But if there's just one document, will tf-idf still work?

BALaka-18 commented 4 years ago

tf-idf is a corpus based algorithm whereas RAKE can work on single documents. Like, if I have a collection of documents, I can figure out the tf-idf scores for words in one document against all the documents in the corpus. But if there's just one document, will tf-idf still work?

See, say you take a corpus of N documents. Like, N Stackoverflow posts for example. Your IDF will depend on all N posts. Do the calculation. Then choose one post, say post_3. Run TF-IDF and RAKE for that chosen document and compare.

2bit-hack commented 4 years ago

Oh, okay. So do I change the function signature to be something like def keywords(docs, selected_doc, stopwords)?

BALaka-18 commented 4 years ago

Oh yes..thanks for pointing that out. Yes you change the function to include the corpus. Make sure you include the max_num parameter.

2bit-hack commented 4 years ago

Alright, thank you so much!

2bit-hack commented 4 years ago

The code for keyword extraction using TF-IDF goes into tfidf_vectorizer/extract_keywords_tfidf_scratch.py. Where do the tests of RAKE vs TF-IDF go? I noticed there's a tests folder, does it go there?

BALaka-18 commented 4 years ago

The code for keyword extraction using TF-IDF goes into tfidf_vectorizer/extract_keywords_tfidf_scratch.py. Where do the tests of RAKE vs TF-IDF go? I noticed there's a tests folder, does it go there?

yes, create a separate file in the tests folder.

2bit-hack commented 4 years ago

So far, I've implemented tf-idf and written a main.py file for testing its usage. I've also created a file in tests which runs both RAKE and tf-idf against a document (RAKE is being given just one document, tf-idf is also fed a small corpus), prints the results from extracting keywords, and prints the runtime of both algorithms. I'm a bit confused by the precision scoring thing though. Should I manually assign keywords/phrases/both? If, for instance, RAKE extracts 'object oriented' and tf-idf extracts 'object' (RAKE generally extracts phrases, tf-idf extracts words only), do I score them equally? I'm not quite sure how exactly the F-scores are being calculated.

BALaka-18 commented 4 years ago

@2bit-hack Basically you have to toil a bit. What they have done in the paper is that, they have marked out keywords manually. Then they ran the algorithms, and checked which algorithm gave results closer to the ones they had manually derived, and calculated the precision score (TP/TP+FP).

Like, say according to you, 'object oriented' is a keyword(it makes more sense), so you have noted it down. Now according to RAKE, it gave 'object oriented' but TF-IDF gave 'object'. So for RAKE, it is a True Positive, but for TF-IDF, it's a False Positive. Iterate for all keywords, and calculate the scores for the two algorithms.

2bit-hack commented 4 years ago

@BALaka-18 I've sent a PR, could you please take a look?