chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.19k stars 246 forks source link

Readability stats use wrong word count due to stop list usage #7

Closed henningko closed 8 years ago

henningko commented 8 years ago

Great work—stumbled across this while writing my own Python script for readability stats. Looking forward to Topic Modeling :)

Between our work on readability scores, I noticed a discrepancy in word count, with far less words counted in your implementation.

Turns out that for calculating the readability stats in textacy.text_stats, you use the following line:

 words = doc.words(filter_punct=True)

which probably should be:

 words = doc.words(filter_punct=True, filter_stops=False)

By setting the default for filtering stop words to filter_stops=True in textacy.extract.words—which is a rather significant change to any text, so maybe the default should be False?—the number of words considered for the readability scores is reduced significantly and renders them incorrect.

bdewilde commented 8 years ago

@henningko Good catch! Thanks, will push the fix today (I'm also releasing the latest version, finally!). Sorry about the delay in responding — I don't get issue notifications emailed to me for some reason, so I only stumbled onto this now. Will keep a closer eye on this.