DaveChild / Text-Statistics

Generate information about text including syllable counts and Flesch-Kincaid, Gunning-Fog, Coleman-Liau, SMOG and Automated Readability scores.
https://readable.com/
BSD 2-Clause "Simplified" License
448 stars 107 forks source link

flesch kincaid statistics are both in error #17

Closed promethean closed 10 years ago

promethean commented 10 years ago

Both the flesch_kincaid_reading_ease() and flesch_kincaid_grade_level() methods are maxing out. The first at 100 and the latter at 19.

Every text block we try has the same issue. And the stats don't tally with those found on readability-score.com

Just FYI - maybe a recent commit has caused a bug to creep in?

DaveChild commented 10 years ago

A recent commit added normalisation of the various scoring systems. They all have maxima and minima and the normalisation ensures that scores are kept within those ranges. The version of the code on Readability-Score.com has this normalisation disabled.

jrfnl commented 10 years ago

@DaveChild Just thought I'd check the unit tests to be sure, but running them with both the code from before the changes I made, as well with the current code base, fails a lot of the tests. So either the tests need adjusting or something else is going on....

jrfnl commented 10 years ago

I've just run the unit tests on four different points in time and have saved the results for your perusal in https://github.com/jrfnl/Text-Statistics/tree/unit-test-results

Results

Tests run on the code base of: # of tests Passed Failed
2010-12-02 36 36 0
2011-12-12 37 30 7
2014-01-14 37 22 15
2014-02-11 37 19 18

By the looks of it some unit tests might need small adjustments in the assertions, but some also indicate that things are going wrong in the code... Let me know if I can be of assistance in fixing this.

jrfnl commented 10 years ago

For completeness, here's a summary of the actual test results:

File: TextStatisticsKiplingIf

Test Expects 2010-12-02 2011-12-12 2014-01-14 2014-02-11 Stopped at first failing test:
KiplingSyllables 1 Passed 2 2 2 Line 312: 'Except'
WordCount 292 Passed Passed Passed Passed
SentenceCount 1 Passed Passed Passed Passed
TextLengthCheck 1125 Passed Passed Passed Passed
FleschKincaidReadingEase -187.5 Passed -187.2 0 8.1
FleschKincaidGradeLevel 111.9 Passed Passed 12 12
GunningFogScore 117.5 Passed Passed 19 19
ColemanLiauIndex 6.9 Passed Passed Passed 12
SMOGIndex 14.1 Passed Passed 12 12
AutomatedReadabilityIndex 142.7 Passed Passed 12 12

File: TextStatisticsMelvilleMobyDick

Test Expects 2010-12-02 2011-12-12 2014-01-14 2014-02-11 Stopped at first failing test:
KiplingSyllables 2 Passed 1 1 1 Line 68: 'Ishmael'
WordCount 201 Passed Passed Passed Passed
LongWordCount 23/22 Passed Passed Passed Passed
SentenceCount 8 Passed Passed Passed Passed
TextLengthCheck 884 Passed Passed Passed Passed
FleschKincaidReadingEase 53.4 Passed 53.8 53.8 100
FleschKincaidGradeLevel 12.1 Passed 12 12 12
GunningFogScore 14.4 Passed Passed Passed 19
ColemanLiauIndex 10.1 Passed Passed Passed 12
SMOGIndex 9.9 Passed Passed Passed Passed
AutomatedReadabilityIndex 11.8 Passed Passed Passed Passed

File: TextStatisticsTest

Test Expects 2010-12-02 2011-12-12 2014-01-14 2014-02-11 Stopped at first failing test:
SyllableCountBasicWords - Passed Passed Passed Passed
SyllableCountComplexWords 3 Passed 1 1 1 Line 132: 'CAPITALS'
SyllableCountProgrammedExceptions - Passed Passed Passed Passed
AverageSyllablesPerWord - Passed Passed Passed Passed
WordCount - Passed Passed Passed Passed
CheckPercentageWordsWithThreeSyllables - Passed Passed Passed Passed
TextLengthCheck - Passed Passed Passed Passed
SentenceCount - Passed Passed Passed Passed
AverageWordsPerSentence - Passed Passed Passed Passed
FleschKincaidReadingEase 121.2 Passed Passed 100 100 Line 223: 'This. Is. A. Nice. Set. Of. Small. Words. Of. One. Part. Each.'
FleschKincaidGradeLevel -3.4 Passed Passed 0 0 Line 232: 'This. Is. A. Nice. Set. Of. Small. Words. Of. One. Part. Each.'
GunningFogScore 0.4 Passed Passed Passed 1 Line 241: 'This. Is. A. Nice. Set. Of. Small. Words. Of. One. Part. Each.'
ColemanLiauIndex 13.6 / 3 Passed Passed 12 12 Line 256: 'Now it is time for a more complicated sentence, including several longer words.' / Line 251: 'This. Is. A. Nice. Set. Of. Small. Words. Of. One. Part. Each.'
SMOGIndex Passed Passed Passed Passed
AutomatedReadabilityIndex -5.6 Passed Passed 0 0 Line 269: 'This. Is. A. Nice. Set. Of. Small. Words. Of. One. Part. Each.'

File: TextStatisticsTestCMULex

Test Expects 2010-12-02 2011-12-12 2014-01-14 2014-02-11 Stopped at first failing test:
SyllableCountFailingCMUWords 3 N/A 2 2 2 Line 31 - "aaa"
DaveChild commented 10 years ago

Right you are. Thanks, I'd assumed it was just down to the rounding. Some of the errors are rounding, but some are errors in calculation.

DaveChild commented 10 years ago

Ok, I think this is now fixed. The failure count will be above zero, as one of the test files is essentially a large list of current known errors. The other files all report correct results now for the unit tests.

I've also added a flag to disable score normalization, for people who don't want their scores normalized.

jrfnl commented 10 years ago

@DaveChild Excellent! Glad to hear my analysis helped.

DaveChild commented 10 years ago

Very much, thanks :).

I'm now working through that awful CMU list to see if I can add a few more rules to the syllable counter, or if it's going to require a lot of manual assignment of values.

jrfnl commented 10 years ago

You are a star! Would you like me to keep the test result branch online or shall I pull it down ?