Open BALaka-18 opened 4 years ago
Is it okay if I work on this issue during Hacktoberfest?
Is it okay if I work on this issue during Hacktoberfest?
Yes sure @2bit-hack . Just make sure the PR is made after Oct 1. Assigning it to you.
I have a few questions regarding the implementation:
get_keywords(document, documents, stopwords)
?)diff
?I have a few questions regarding the implementation:
- Will there be a corpus of documents provided against which to run the TF-IDF algorithm? (Something like
get_keywords(document, documents, stopwords)
?)- Will these documents be text files or plaintext strings (or potentially both)?
- How do I measure performance / accuracy of the implementation versus the other algorithms? Do I perform a
diff
?
max_num == None
, then extract all keywords.
Else, extract max_num
keywords.
Default : None
The function will return the keywords along with their tf-idf weights.rake_new2
returns phrases mostly. So, in order to compare :
rake_new2
and tf-idf and calculate their recall / precision score (against the
keywords you manually extracted). READ THIS PAPER TO GET AN IDEA OF WHAT I MEAN TO SAY.tf-idf is a corpus based algorithm whereas RAKE can work on single documents. Like, if I have a collection of documents, I can figure out the tf-idf scores for words in one document against all the documents in the corpus. But if there's just one document, will tf-idf still work?
tf-idf is a corpus based algorithm whereas RAKE can work on single documents. Like, if I have a collection of documents, I can figure out the tf-idf scores for words in one document against all the documents in the corpus. But if there's just one document, will tf-idf still work?
See, say you take a corpus of N documents. Like, N Stackoverflow posts for example. Your IDF will depend on all N posts. Do the calculation. Then choose one post, say post_3. Run TF-IDF and RAKE for that chosen document and compare.
Oh, okay. So do I change the function signature to be something like def keywords(docs, selected_doc, stopwords)
?
Oh yes..thanks for pointing that out. Yes you change the function to include the corpus.
Make sure you include the max_num
parameter.
Alright, thank you so much!
The code for keyword extraction using TF-IDF goes into tfidf_vectorizer/extract_keywords_tfidf_scratch.py
. Where do the tests of RAKE vs TF-IDF go? I noticed there's a tests
folder, does it go there?
The code for keyword extraction using TF-IDF goes into
tfidf_vectorizer/extract_keywords_tfidf_scratch.py
. Where do the tests of RAKE vs TF-IDF go? I noticed there's atests
folder, does it go there?
yes, create a separate file in the tests folder.
So far, I've implemented tf-idf and written a main.py file for testing its usage. I've also created a file in tests
which runs both RAKE and tf-idf against a document (RAKE is being given just one document, tf-idf is also fed a small corpus), prints the results from extracting keywords, and prints the runtime of both algorithms. I'm a bit confused by the precision scoring thing though. Should I manually assign keywords/phrases/both? If, for instance, RAKE extracts 'object oriented' and tf-idf extracts 'object' (RAKE generally extracts phrases, tf-idf extracts words only), do I score them equally? I'm not quite sure how exactly the F-scores are being calculated.
@2bit-hack Basically you have to toil a bit. What they have done in the paper is that, they have marked out keywords manually. Then they ran the algorithms, and checked which algorithm gave results closer to the ones they had manually derived, and calculated the precision score (TP/TP+FP).
Like, say according to you, 'object oriented' is a keyword(it makes more sense), so you have noted it down. Now according to RAKE, it gave 'object oriented' but TF-IDF gave 'object'. So for RAKE, it is a True Positive, but for TF-IDF, it's a False Positive. Iterate for all keywords, and calculate the scores for the two algorithms.
@BALaka-18 I've sent a PR, could you please take a look?
Description
TF-IDF is one of the most famous algorithms when it comes to keyword extraction from text. Your task is to create a function that will extract keywords from text using the TF-IDF algorithm and compare the results against this library. How similar / different are the results ?
NOTE : You have to build the Tf-idf algorithm for keyword extraction from scratch. You will then compare its performance against sklearn's TfidfVectorizer and rake_new2.
For reference :
For your reference, you may read this link
Folder Structure, Function details
Create a folder
tfidf_vectorizer
in the root directory. The folder must contain a.py
file that will contain the function for extracting the keywords from text using the Tfidf algorithm written from scratch.Structure :
tfidf_vectorizer/extract_keywords_tfidf_scratch.py
Acceptance Criteria
requirements.txt file
is updated if you are including any new library.Definition of Done
Time Estimation
2.5-3 hours (or more if needed)