cdimascio / py-readability-metrics

📗 Score text readability using a number of formulas: Flesch-Kincaid Grade Level, Gunning Fog, ARI, Dale Chall, SMOG, and more
https://py-readability-metrics.readthedocs.io/en/latest/
MIT License
338 stars 58 forks source link

Question: Why NLTK TweetTokenizer? #26

Open freecraver opened 2 years ago

freecraver commented 2 years ago

Thanks for your work on this nice project.

I intend to create a library for text simplification, and potentially would like to integrate your package. The selection of a tokenizer has an impact on the obtained readability scores and I was wondering how you approached this issue.

Was there any specific reason for choosing the Tweet-Tokenizer over e.g. the default/recommended Nltk-Tokenizer which better depicts the Penn Treebank's definition of word-boundaries? https://github.com/cdimascio/py-readability-metrics/blob/3ffb97f6057ae2451599d083a69ece78a61a6fa4/readability/text/analyzer.py#L128

cdimascio commented 2 years ago

@freecraver I'm open to changing the tokenizer. Would you be interested in investigating the effort to switch over?

freecraver commented 2 years ago

Sure - please check https://github.com/cdimascio/py-readability-metrics/pull/27 for my suggested changes.