IndigoResearch / textteaser

Official version of TextTeaser.
MIT License
621 stars 143 forks source link

Parser.getSentencePositionScore() off-by-one #10

Open schmamps opened 6 years ago

schmamps commented 6 years ago

In parser.py:

    # Jagadeesh, J., Pingali, P., & Varma, V. (2005). Sentence Extraction Based Single Document Summarization. International Institute of Information Technology, Hyderabad, India, 5.
    def getSentencePositionScore(self, i, sentenceCount):
        normalized = i / (sentenceCount * 1.0)

        if normalized > 0 and normalized <= 0.1:
            return 0.17
        # ... snip ....
        elif normalized > 0.8 and normalized <= 0.9:
            return 0.04
        elif normalized > 0.9 and normalized <= 1.0:
            return 0.15
        else:
            return 0

When Summarizer scores sentences (Summarizer.computeScore()) it iterates through the list and passes the index as its argument. Notice what happens when you unroll that loop, though:


# where: ['sentence 1', 'sentence 2', ..., 'sentence 10']
sentences = ['sentence ' + str(x + 1) for x in range(10)]

# first sentence, returns 0
sentencePosition = parser.getSentencePositionScore(0, len(sentences))
# ...
# last sentence, returns 0.04
sentencePosition = parser.getSentencePositionScore(9, len(sentences))

Instinct tells me that a (next-to-)zero score for the introductory and concluding sentences is wrong. Adding to my suspicions, sum of all cases in the referenced paper are 0 < normalized <= 1.0.

To me this looks like an off-by-one bug. Nine of the ten sentences above are caught one case too early and the first falls to default. That shouldn't even happen with valid arguments!

If I'm opinionated about this as an issue, respectable people can differ about where to fix it. One could make a case for Summarizer.computeScore() but Parser.getSentencePositionScore() is already parameterized for index (zero-based, presumably) so fixing it there has the smallest impact.