Closed abal6725 closed 3 years ago
The reason I put the statement:
if r_freq > 0.5:
continue
is because words with so high frequency across all the documents are almost always stopwords, noise and they don't carry any importance for the document, especially for question answering and fact-checking in a very large corpus. If you build the inverted index including these very high frequency words, there's also an efficiency problem, because you will have to calculate tfidf scores for a much larger number of documents on average. You can change the r_freq threshold depending on your data, and I think it's a good idea to refactor the code and pass it as a parameter instead of having it as a fixed value. I will make it a configurable parameter in the next version of the framework.
I completely agree that stopwords shouldnt be included in the index as it will bring the efficiency down. However:
I have run some numbers for a corpus size of 10000.
As you can see where ever r_freq > 0.5 idf is always negative so infact you are not storing any negative idf values and they are never replaced by average epsilon value so that part of your code is in effect never being run.
Yes, for the given threshold of 0.5 the rest of the code never runs. That's why I'm planning to replace the hardcoded threshold value with a variable parameter that can be set as an argument when you intitialise the sparse retriever. For some small collections this threshold maybe should be omitted / set to 0 for best results.
Hello,
I have found a small error in your building of the sparse index, There is a small mistake in your _calc_idf funtion for sparse indexes.
In this function you are using an eps value to replace the negative idf values. However your code is a little bit faulty.
1) Firstly you are using r_freq to determine negative idf value, why not just use the idf value itself as a check? 2) your statement:
Is not correct because what is happening here is that if the r_freq > 0.5 (or idf is negative) it skips the rest of the code under the if statement including:
So those words whose idf's are negative never appear in the idf dictionary. You can confirm this by checking your idf dictionary for a word which exists in more than 50% of the corpus.
I would suggest changing the _calc_idf function to:
This will ensure that the negative idf values are replaced with the value of epsilon and that they are present in the idf dictionary.