andreasvc / readability

Measure the readability of a given text using surface characteristics
Other
71 stars 17 forks source link

'readability grades' seem to be incorrect #3

Closed brianray closed 7 years ago

brianray commented 9 years ago

At first look 'readability grades' are not what I expected and do not match what was expected for this text.

import readability

text = '''Existing computer programs that measure readability are based
largely upon subroutines which estimate number of syllables, usually by
counting vowels. The shortcoming in estimating syllables is that it
necessitates keypunching the prose into the computer. There is no need to
estimate syllables since word length in letters is a better predictor of
readability than word length in syllables. Therefore, a new readability
formula was computed that has for its predictors letters per 100 words and
sentences per 100 words. Both predictors can be counted by an optical scanning
device, and thus the formula makes it economically feasible for an
organization such as the U.S. Office of Education to calibrate the readability
of all textbooks for the public school system.'''

stats = readability.getmeasures(text)
print stats['readability grades']['Coleman-Liau'], "is not close to 14.5"
print "see http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index"
print stats['readability grades']

-45.4349494945 is not close to 14.5 see http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index OrderedDict([(u'Kincaid', -12.08887191321185), (u'ARI', -16.303441981747067), (u'Coleman-Liau', -45.434949494522684), (u'FleschReadingEase', 183.21755623703106), (u'GunningFogIndex', 0.33324641460234683), (u'LIX', 0.833116036505867), (u'SMOGIndex', 3.0), (u'RIX', 0.0)]) [Finished in 0.3s]

Is this because the input is not tokenized? What does the tokenized input look like? I also tried this: https://github.com/proycon/ucto/blob/master/tests/ligaturen.nl.tok.V

And it does not seem to help.

andreasvc commented 9 years ago

Yes the input needs to be tokenized, but not in the format you linked. The format is simply one sentence per line, and every word/punctuation separated by space. The README gives an example with ucto.

brianray commented 9 years ago

Even if I use the example from the ucto documentation (from https://raw.githubusercontent.com/proycon/ucto/master/docs/ucto_manual.pdf), I still get an unusual (negative in fact) score.

import readability

text = 'Mr. John Doe goes to the pet store . '

stats = readability.getmeasures(text)
print stats['readability grades']['Coleman-Liau'], "is not close to 14.5"
print "see http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index"
print stats['readability grades']

-52.0259283846 is not close to 14.5 see http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index OrderedDict([(u'Kincaid', -13.046715176715177), (u'ARI', -16.368648648648648), (u'Coleman-Liau', -52.02592838461538), (u'FleschReadingEase', 189.85252598752598), (u'GunningFogIndex', 0.2810810810810811), (u'LIX', 0.7027027027027027), (u'SMOGIndex', 3.0), (u'RIX', 0.0)]) [Finished in 0.4s]

andreasvc commented 9 years ago

You're right, the numbers are not scaled correctly. I'll look into adjusting the formulas.

andreasvc commented 9 years ago

Actually, with a properly tokenized version of the abstract [1], I get a score of 14.28, which seems close enough. Applying the measures to a single sentence gives a different result because the average word/sentence lengths are not representative.

[1] https://gist.github.com/andreasvc/7f702fc35545bd4d378f

andreasvc commented 9 years ago

Also, I ran this from the command line. In your example, the string is interpreted as an iterable of lines, because the code expects that or a unicode string.

brianray commented 7 years ago

When I run this in Python 3

import readability

text = 'Mr. John Doe goes to the pet store . '
stats = readability.getmeasures(text)
print(stats['readability grades']['Coleman-Liau'])

-0.3896982499999986

andreasvc commented 7 years ago

It seems to me that the code correctly implements the formula listed at http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

I do not think this is an issue with the code, but an inherent limitation of this readability measure. Otherwise please suggest a solution.

brianray commented 7 years ago
from nltk import word_tokenize

text = 'Mr. John Doe goes to the pet store . '
Letters = len(text.replace(".", "").replace(" ", ""))
Words = len(word_tokenize(text))
Sentences = 1
L = (Letters / Words) * 100
S = (Sentences / Words) * 100
CLI = (0.0588 * L) - (0.296 * S) - 15.8 
CLI

-2.102222222222224

brianray commented 7 years ago

Ok, this matches now

(5.879851 * Letters / Words - 29.587280 * Sentences / Words - 15.800804)
andreasvc commented 7 years ago

len(word_tokenize(text)) counts the final . as a word, that's why the result is different.

brianray commented 7 years ago

👍