Closed brianray closed 7 years ago
Yes the input needs to be tokenized, but not in the format you linked. The format is simply one sentence per line, and every word/punctuation separated by space. The README gives an example with ucto.
Even if I use the example from the ucto documentation (from https://raw.githubusercontent.com/proycon/ucto/master/docs/ucto_manual.pdf), I still get an unusual (negative in fact) score.
import readability
text = 'Mr. John Doe goes to the pet store . '
stats = readability.getmeasures(text)
print stats['readability grades']['Coleman-Liau'], "is not close to 14.5"
print "see http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index"
print stats['readability grades']
-52.0259283846 is not close to 14.5 see http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index OrderedDict([(u'Kincaid', -13.046715176715177), (u'ARI', -16.368648648648648), (u'Coleman-Liau', -52.02592838461538), (u'FleschReadingEase', 189.85252598752598), (u'GunningFogIndex', 0.2810810810810811), (u'LIX', 0.7027027027027027), (u'SMOGIndex', 3.0), (u'RIX', 0.0)]) [Finished in 0.4s]
You're right, the numbers are not scaled correctly. I'll look into adjusting the formulas.
Actually, with a properly tokenized version of the abstract [1], I get a score of 14.28, which seems close enough. Applying the measures to a single sentence gives a different result because the average word/sentence lengths are not representative.
Also, I ran this from the command line. In your example, the string is interpreted as an iterable of lines, because the code expects that or a unicode string.
When I run this in Python 3
import readability
text = 'Mr. John Doe goes to the pet store . '
stats = readability.getmeasures(text)
print(stats['readability grades']['Coleman-Liau'])
-0.3896982499999986
It seems to me that the code correctly implements the formula listed at http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index
I do not think this is an issue with the code, but an inherent limitation of this readability measure. Otherwise please suggest a solution.
from nltk import word_tokenize
text = 'Mr. John Doe goes to the pet store . '
Letters = len(text.replace(".", "").replace(" ", ""))
Words = len(word_tokenize(text))
Sentences = 1
L = (Letters / Words) * 100
S = (Sentences / Words) * 100
CLI = (0.0588 * L) - (0.296 * S) - 15.8
CLI
-2.102222222222224
Ok, this matches now
(5.879851 * Letters / Words - 29.587280 * Sentences / Words - 15.800804)
len(word_tokenize(text))
counts the final .
as a word, that's why the result is different.
👍
At first look 'readability grades' are not what I expected and do not match what was expected for this text.
-45.4349494945 is not close to 14.5 see http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index OrderedDict([(u'Kincaid', -12.08887191321185), (u'ARI', -16.303441981747067), (u'Coleman-Liau', -45.434949494522684), (u'FleschReadingEase', 183.21755623703106), (u'GunningFogIndex', 0.33324641460234683), (u'LIX', 0.833116036505867), (u'SMOGIndex', 3.0), (u'RIX', 0.0)]) [Finished in 0.3s]
Is this because the input is not tokenized? What does the tokenized input look like? I also tried this: https://github.com/proycon/ucto/blob/master/tests/ligaturen.nl.tok.V
And it does not seem to help.