cdimascio / py-readability-metrics

📗 Score text readability using a number of formulas: Flesch-Kincaid Grade Level, Gunning Fog, ARI, Dale Chall, SMOG, and more
https://py-readability-metrics.readthedocs.io/en/latest/
MIT License
361 stars 60 forks source link

Allow word tokenization override #27

Open freecraver opened 2 years ago

freecraver commented 2 years ago

See https://github.com/cdimascio/py-readability-metrics/issues/26#issuecomment-1046301510

Introduces a non-breaking change which allows to override custom word-level tokenization.

The new f_tokenize_words argument accepts a function which maps a text to its words.

example:

from nltk import word_tokenize
r = Readability(text, f_tokenize_words=word_tokenize)

Tests run ✔️ Tests added ✔️ Added section 'What makes a word' to Readme ✔️

Additional remarks:

Text Tweet Treebank
"We've got two different solutions" ["We've", 'got', 'two', 'different', 'solutions'] ['We', "'ve", 'got', 'two', 'different', 'solutions']
'How common are abbreviations in the U.S.?' ['How', 'common', 'are', 'abbreviations', 'in', 'the', 'U', '.', 'S', '.', '?'] ['How', 'common', 'are', 'abbreviations', 'in', 'the', 'U.S.', '?']