Open freecraver opened 2 years ago
See https://github.com/cdimascio/py-readability-metrics/issues/26#issuecomment-1046301510
Introduces a non-breaking change which allows to override custom word-level tokenization.
The new f_tokenize_words argument accepts a function which maps a text to its words.
f_tokenize_words
example:
from nltk import word_tokenize r = Readability(text, f_tokenize_words=word_tokenize)
Tests run ✔️ Tests added ✔️ Added section 'What makes a word' to Readme ✔️
Additional remarks:
TweetTokenizer
TreebankWordTokenizer
"We've got two different solutions"
["We've", 'got', 'two', 'different', 'solutions']
['We', "'ve", 'got', 'two', 'different', 'solutions']
'How common are abbreviations in the U.S.?'
['How', 'common', 'are', 'abbreviations', 'in', 'the', 'U', '.', 'S', '.', '?']
['How', 'common', 'are', 'abbreviations', 'in', 'the', 'U.S.', '?']
See https://github.com/cdimascio/py-readability-metrics/issues/26#issuecomment-1046301510
Introduces a non-breaking change which allows to override custom word-level tokenization.
The new
f_tokenize_words
argument accepts a function which maps a text to its words.example:
Tests run ✔️ Tests added ✔️ Added section 'What makes a word' to Readme ✔️
Additional remarks:
TweetTokenizer
and theTreebankWordTokenizer
I observed is the handling of clitics and abbreviations:"We've got two different solutions"
["We've", 'got', 'two', 'different', 'solutions']
['We', "'ve", 'got', 'two', 'different', 'solutions']
'How common are abbreviations in the U.S.?'
['How', 'common', 'are', 'abbreviations', 'in', 'the', 'U', '.', 'S', '.', '?']
['How', 'common', 'are', 'abbreviations', 'in', 'the', 'U.S.', '?']