SMOG calculation discrepancies

srdjan-stojkovic commented 9 years ago

Hi,

Text: "June 23rd, 2015 How Cigna deal limits Anthem’s Blue Cross brand “When health plans operate using the Blue Cross and Blue Shield brand, they are generally limited to business in a specific state or region as part of a licensing agreement with their trade group, the Blue Cross and Blue Shield Association. So when Anthem (ANTM), a major operator of Blue Cross plans, made its $184-a-share offer for Cigna (CI) to grow both health insurance businesses, it created potential hurdles when it comes to Anthem’s valuable Blue Cross brands expanding."

On readability-score.com I'm getting value of 15.2 for SMOG, but with $textStatistics->smogIndex($input) only 9.4. This is big difference. Am I doing something wrong?

gburtini commented 8 years ago

https://travis-ci.org/DaveChild/Text-Statistics

It appears SMOG is broken. Can anyone confirm this?

jee7 commented 6 years ago

Yes. There seem to be many issues here.

The SMOG value here is always "normalized" (ie clamped) to the range [0, 12]. With that enabled you can never get that 15.2.
```
public $normalise = false;
```

The SMOG formula is implemented wrong. It is taking the square root of the sum and lastly multiplies, but actually the order should be: square root, multiplication and then the sum.

        Maths::bcCalc(
            Maths::bcCalc(
                Maths::bcCalc(
                    Syllables::wordsWithThreeSyllables($strText, true, $this->strEncoding),
                    '*',
                    Maths::bcCalc(
                        30,
                        '/',
                        Text::sentenceCount($strText, $this->strEncoding)
                    )
                ),
                'sqrt',
                0
            ),
            '*',
            1.043
        ),
        '+',
        3.1291
    );

When the input text is cleaned it is utf8_decoded. However, if you have an ASCII text, then some symbols get converted to "?" signs and those will be interpreted as terminators. So in your example text there are 2 sentences, but the script finds 5.
```
//$strText = utf8_decode($strText);
```

I'm not sure, but I also removed all the words that contain numbers. I dunno. It didn't make sense to me to count "23rd" or "$184-a-share" as words.

$strText = preg_replace('/([^\.\s]*[0-9][^\.\s]*)/', '', $strText); // Remove words with numbers
$strText = preg_replace('/\'/', '', $strText); // Remove ' symbol, dunno if helps.
$strText = preg_replace('`  `', ' ', $strText); // Remove double spaces (because for some reason you calculate words based on number of spaces)

Now, I don't have an account on readability-score.com, but I tried with other online calculators:

	Online-Utility	LearningAndWork	StoryToolz	Current TS	Improved TS
characters	437	-	436	427	425
words	94	92	94	94	92
poly-words	-	14	-	13	13
sentences	2	2	2	5	2
syl. per word	1.48	-	1.38	1.44	1.46
ARI	23.97	-	23.9	9.4	23.3
Gunning-F	23.06	-	23.9	12.6	23.6
Flesch-K	20.19	-	19.1	8.7	19.5
Coleman-L	10.94	-	10.8	10.9	11.4
SMOG	16.96	23.2	16.4	9.4	17.7

I also tried with my own test text, which is a bit longer.

	Online-Utility	LearningAndWork	StoryToolz	Current TS	Improved TS
characters	2919	-	2924	2899	2890
words	604	604	592	604	585
poly-words	-	90	-	108	108
sentences	32	32	32	32	32
syl. per word	1.65	-	1.58	1.64	1.68
ARI	10.77	-	11.1	10.6	11
Gunning-F	12.32	-	13.4	14	13.9
Flesch-K	11.2	-	10.3	11.2	11.4
Coleman-L	11.08	-	11.6	12	13.3
SMOG	12.8	17.7	12.1	10.7	13.6

But, yeah, there still seem to be problems. For example now the Coleman-Liau index went up compared to the other calculators.

DaveChild / Text-Statistics

SMOG calculation discrepancies #31