Wrong formula for the Coleman–Liau index

brucewlee / lingfeat

[EMNLP 2021] LingFeat - A Comprehensive Linguistic Features Extraction ToolKit for Readability Assessment

Creative Commons Attribution Share Alike 4.0 International

121 stars 16 forks source link

Wrong formula for the Coleman–Liau index #6

Open kduxin opened 1 year ago

kduxin commented 1 year ago

Hi Lee,

There are two places wrong in the formula.

First, the original Coleman-Liau counts the number of letters per 100 words. Whereas in the code, it counts the number of tokens per 100 words.

Second, it is wrong to account for "per 100 words" by dividing the number by 100. Rather, it should be $$n{letters} / (n{tokens} / 100)$$

As a result, the produced score is always around -15.0.

My installed lingfeat version is 1.00b19. Have you fixed it up ?

brucewlee commented 1 year ago

I see. You are correct.

Thank you @kduxin. I'll make the appropriate changes.

MarioGalindoQ commented 1 year ago

Hi Bruce, I discovered the same bug. In the file TraF.py at line 84, I think that the right code is: result = 0.0588 (self.n_char / self.n_token 100) - 0.296 (self.n_sent 100 / self.n_token) - 15.8 Hope this will help you. Thanks

brucewlee commented 1 year ago

Thank you Professor Queralt for the suggestions, including the one at #4 . I am planning to restructure this project and release a better version. Though I have been busy due to my job since releasing this library, I sincerely appreciate the continued attention.

brucewlee commented 1 year ago

The new update will likely be in November.