cadmiumcr / cadmium

Natural Language Processing (NLP) library for Crystal
https://cadmiumcr.com
MIT License
205 stars 15 forks source link

Negative values for readability scores #14

Closed rmarronnier closed 5 years ago

rmarronnier commented 5 years ago
test = "*-/ /*/"
test_readability = Cadmium.readability.new(test).fog # or flesch or kincaid
puts test_readability

Outputs

-NaN

For some longer text (which I can provide, if you need) I get smaller negative values (eg : -620).

Is this the expected behavior ? Shouldn't the value be set to 0 ?

hugoabonizio commented 5 years ago

The #flesh method is defined as:

https://github.com/watzon/cadmium/blob/589fcf60b24280ff0b6666ba9df54f212ebca9b0/src/cadmium/readability.cr#L92-L94

Thus, depending on the variables (e.g. 6 words, 15 syllables and 1 sentence), the result would be -10.754999999999995. You can read more about this here.

rmarronnier commented 5 years ago

Thanks @hugoabonizio for the siteimprove you linked to !

So that's the developer's responsibility to "clean up" the scores he gets. Ok.

@watzon : Would you be interested in PRs implementing other readability tests listed here ? Examples :

watzon commented 5 years ago

Hmm, we should probably cap the score at 0 and make sure it doesn't go lower. I would definitely be interested in a PR for other types of readability scores as well @rmarronnier

rmarronnier commented 5 years ago

I'll let you decide about capping scores. But FYI, I used the cadmium flesch test on 2000+ GPT generated texts. And the lowest ones got around -600 (less than a dozens) (some Eminem-like lyrics without punctuation). The "garbled" ones (random succession of special characters) got -Nan. So a cap at -1000 or -2000 should be pretty safe IMO.

ok, one PR per test, let's roll ! ;-)