brucewlee / lingfeat

[EMNLP 2021] LingFeat - A Comprehensive Linguistic Features Extraction ToolKit for Readability Assessment
Creative Commons Attribution Share Alike 4.0 International
120 stars 16 forks source link

Duplicate feature names in OSKF and WBKF #4

Open iris2hu opened 2 years ago

iris2hu commented 2 years ago

Hello, thanks for this great project!

Recently we are trying to reproduce the experimental results in your paper:

Lee, Bruce W., Yoo Sung Jang, and Jason Lee. "Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features." Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.

Just found that the OSKF method in lingfeat returned exactly the same 16 feature names as in WBKF. Please see examples below:

from lingfeat import extractor

text = "When you see the word Amazon, what’s the first thing that springs to mind – the world’s biggest forest, the longest river or the largest internet retailer – and which do you consider most important?"
LingFeat = extractor.pass_text(text)
LingFeat.preprocess()

WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features
OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

print('WeeBit Corpus Knowledge Features:', WBKF)
print('OneStopEng Corpus Knowledge Features:', OSKF)

Terminal Output

WeeBit Corpus Knowledge Features:  {'BRich05_S': 1.1274421401321888, 'BRich10_S': 4.858168950304389, 'BRich15_S': 20.647890945896506, 'BRich20_S': 21.932124523445964, 'BClar05_S': 0.5823907653490702, 'BClar10_S': 0.718731752038002, 'BClar15_S': 0.7291195740302404, 'BClar20_S': 0.7486800486626832, 'BNois05_S': 1.5104791224775047, 'BNois10_S': 6.548753840448406, 'BNois15_S': 7.018329580783902, 'BNois20_S': 8.321480132061497, 'BTopc05_S': 3, 'BTopc10_S': 10, 'BTopc15_S': 18, 'BTopc20_S': 23}
OneStopEng Corpus Knowledge Features:  {'BRich05_S': 2.9044833183288574, 'BRich10_S': 3.5476092249155045, 'BRich15_S': 9.398028403520584, 'BRich20_S': 14.846967313438654, 'BClar05_S': 0.00015333294868469238, 'BClar10_S': 0.25143229961395264, 'BClar15_S': 0.6553432226181031, 'BClar20_S': 0.7100768367449443, 'BNois05_S': 1.0000004289882432, 'BNois10_S': 1.4495860709293316, 'BNois15_S': 4.214530509499038, 'BNois20_S': 5.500046277858743, 'BTopc05_S': 2, 'BTopc10_S': 3, 'BTopc15_S': 10, 'BTopc20_S': 15}

According to Appendix B of the above paper, the feature names in OSKF should start with 'O', e.g. 'ORich05_S', 'ORich10_S', etc.

This bug yields 239 distinct feature names (not 255 features as introduced in the paper). Accordingly, in another open-source project of this paper:

https://github.com/brucewlee/pushingonreadability_traditional_ML

The csv files in Research_Data included only 239 linguistic features which we believe were caused by these duplicate feature names.

brucewlee commented 2 years ago

Hi. I sincerely apologize for my late reply and thank you for your interest.

I'm a little busy for EMNLP 2022. I will fix the pointed out mistake in mid-June.

If you need any other help in reproducing the results, please email me so I can help!

Thanks :)

MarioGalindoQ commented 1 year ago

Hi Bruce, The solution to this bug is easy. In the file _AdvancedSemantic/OSKF.py form line 90 it is necessary to change: "BRich" with "ORich", "BClar" with "OClar", "BNois" with "ONois" and "BTopc" with "OTopc" Obviously you know this, but I wrote the solution to help others. Thank you.