DaveChild / Text-Statistics

Generate information about text including syllable counts and Flesch-Kincaid, Gunning-Fog, Coleman-Liau, SMOG and Automated Readability scores.
https://readable.com/
BSD 2-Clause "Simplified" License
446 stars 107 forks source link

SMOG calculation discrepancies #31

Open srdjan-stojkovic opened 9 years ago

srdjan-stojkovic commented 9 years ago

Hi,

Text: "June 23rd, 2015 How Cigna deal limits Anthem’s Blue Cross brand “When health plans operate using the Blue Cross and Blue Shield brand, they are generally limited to business in a specific state or region as part of a licensing agreement with their trade group, the Blue Cross and Blue Shield Association. So when Anthem (ANTM), a major operator of Blue Cross plans, made its $184-a-share offer for Cigna (CI) to grow both health insurance businesses, it created potential hurdles when it comes to Anthem’s valuable Blue Cross brands expanding."

On readability-score.com I'm getting value of 15.2 for SMOG, but with $textStatistics->smogIndex($input) only 9.4. This is big difference. Am I doing something wrong?

gburtini commented 8 years ago

https://travis-ci.org/DaveChild/Text-Statistics

It appears SMOG is broken. Can anyone confirm this?

jee7 commented 6 years ago

Yes. There seem to be many issues here.

  1. The SMOG value here is always "normalized" (ie clamped) to the range [0, 12]. With that enabled you can never get that 15.2.
    public $normalise = false;
  2. The SMOG formula is implemented wrong. It is taking the square root of the sum and lastly multiplies, but actually the order should be: square root, multiplication and then the sum.
            Maths::bcCalc(
                Maths::bcCalc(
                    Maths::bcCalc(
                        Syllables::wordsWithThreeSyllables($strText, true, $this->strEncoding),
                        '*',
                        Maths::bcCalc(
                            30,
                            '/',
                            Text::sentenceCount($strText, $this->strEncoding)
                        )
                    ),
                    'sqrt',
                    0
                ),
                '*',
                1.043
            ),
            '+',
            3.1291
        );
  3. When the input text is cleaned it is utf8_decoded. However, if you have an ASCII text, then some symbols get converted to "?" signs and those will be interpreted as terminators. So in your example text there are 2 sentences, but the script finds 5.
    //$strText = utf8_decode($strText);
  4. I'm not sure, but I also removed all the words that contain numbers. I dunno. It didn't make sense to me to count "23rd" or "$184-a-share" as words.
    $strText = preg_replace('/([^\.\s]*[0-9][^\.\s]*)/', '', $strText); // Remove words with numbers
    $strText = preg_replace('/\'/', '', $strText); // Remove ' symbol, dunno if helps.
    $strText = preg_replace('`  `', ' ', $strText); // Remove double spaces (because for some reason you calculate words based on number of spaces)

Now, I don't have an account on readability-score.com, but I tried with other online calculators:

Online-Utility LearningAndWork StoryToolz Current TS Improved TS
characters 437 - 436 427 425
words 94 92 94 94 92
poly-words - 14 - 13 13
sentences 2 2 2 5 2
syl. per word 1.48 - 1.38 1.44 1.46
ARI 23.97 - 23.9 9.4 23.3
Gunning-F 23.06 - 23.9 12.6 23.6
Flesch-K 20.19 - 19.1 8.7 19.5
Coleman-L 10.94 - 10.8 10.9 11.4
SMOG 16.96 23.2 16.4 9.4 17.7

I also tried with my own test text, which is a bit longer.

Online-Utility LearningAndWork StoryToolz Current TS Improved TS
characters 2919 - 2924 2899 2890
words 604 604 592 604 585
poly-words - 90 - 108 108
sentences 32 32 32 32 32
syl. per word 1.65 - 1.58 1.64 1.68
ARI 10.77 - 11.1 10.6 11
Gunning-F 12.32 - 13.4 14 13.9
Flesch-K 11.2 - 10.3 11.2 11.4
Coleman-L 11.08 - 11.6 12 13.3
SMOG 12.8 17.7 12.1 10.7 13.6

But, yeah, there still seem to be problems. For example now the Coleman-Liau index went up compared to the other calculators.