Sparse Index not building properly

abal6725 commented 4 years ago

Hello,

I have found a small error in your building of the sparse index, There is a small mistake in your _calc_idf funtion for sparse indexes.

In this function you are using an eps value to replace the negative idf values. However your code is a little bit faulty.

1) Firstly you are using r_freq to determine negative idf value, why not just use the idf value itself as a check? 2) your statement:

if r_freq > 0.5:
       continue

Is not correct because what is happening here is that if the r_freq > 0.5 (or idf is negative) it skips the rest of the code under the if statement including:

if idf < 0:
   negative_idfs.append(word)

So those words whose idf's are negative never appear in the idf dictionary. You can confirm this by checking your idf dictionary for a word which exists in more than 50% of the corpus.

I would suggest changing the _calc_idf function to:

    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        negative_idfs = []
        for word, freq in nd.items():
            idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
            #r_freq = float(freq) / self.corpus_size
            if idf < 0:
                negative_idfs.append(word)
                continue
            self.idf[word] = idf
            idf_sum += idf      
        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps

This will ensure that the negative idf values are replaced with the value of epsilon and that they are present in the idf dictionary.

algoprog commented 4 years ago

The reason I put the statement:

if r_freq > 0.5:
       continue

is because words with so high frequency across all the documents are almost always stopwords, noise and they don't carry any importance for the document, especially for question answering and fact-checking in a very large corpus. If you build the inverted index including these very high frequency words, there's also an efficiency problem, because you will have to calculate tfidf scores for a much larger number of documents on average. You can change the r_freq threshold depending on your data, and I think it's a good idea to refactor the code and pass it as a parameter instead of having it as a fixed value. I will make it a configurable parameter in the next version of the framework.

abal6725 commented 4 years ago

I completely agree that stopwords shouldnt be included in the index as it will bring the efficiency down. However:

Arent you already removing the stopwords during preprocessing?

I have run some numbers for a corpus size of 10000.

As you can see where ever r_freq > 0.5 idf is always negative so infact you are not storing any negative idf values and they are never replaced by average epsilon value so that part of your code is in effect never being run.

algoprog commented 4 years ago

Yes, for the given threshold of 0.5 the rest of the code never runs. That's why I'm planning to replace the hardcoded threshold value with a variable parameter that can be set as an argument when you intitialise the sparse retriever. For some small collections this threshold maybe should be omitted / set to 0 for best results.

algoprog / Quin

Sparse Index not building properly #2