moved existing text preprocessing functionality from a top-level preprocess module into a preprocessing sub-package, and reorganized it a bit
add new functionality
replace_hashtags() to replace hashtags like #FollowFriday or #spacyIRL2019 with _TAG_
replace_user_handles() to replace user handles like @bjdewilde or @spacy_io with _USER_
normalize_hyphenated_words() to join hyphenated words back together, like antici- pation => anticipation
normalize_quotation_marks() to replace "fancy" quotation marks with simple ascii equivalents, like “the god particle” => "the god particle"
changed a couple functions for clarity and consistency
replace_currency_symbols() now replaces all dedicated ascii and unicode currency symbols with _CUR_, rather than just a subset thereof, and no longer provides for replacement with the corresponding currency code (like $ => USD)
remove_punct() now has a fast (bool) kwarg rather than method (str) because it's easier and clarifies the difference between the two options
removed some bad/awkward functionality
normalize_contractions(): this was a clunky, slow, and very limited attempt; better to use a separate, dedicated package
preprocess_text(): this was an awkward attempt at user convenience; better to let users mix and match their preprocessing pipeline as needed
added more and better tests for all of the above
Motivation and Context
This part of the code base has acquired some cobwebs over the years, and Issue #250 reminded me that more work was required than a hotfix.
Types of changes
[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
[x] My code follows the code style of this project.
[x] My change requires a change to the documentation, and I have updated it accordingly.
Description
preprocess
module into apreprocessing
sub-package, and reorganized it a bitreplace_hashtags()
to replace hashtags like#FollowFriday
or#spacyIRL2019
with_TAG_
replace_user_handles()
to replace user handles like@bjdewilde
or@spacy_io
with_USER_
normalize_hyphenated_words()
to join hyphenated words back together, likeantici- pation
=>anticipation
normalize_quotation_marks()
to replace "fancy" quotation marks with simple ascii equivalents, like“the god particle”
=>"the god particle"
replace_currency_symbols()
now replaces all dedicated ascii and unicode currency symbols with_CUR_
, rather than just a subset thereof, and no longer provides for replacement with the corresponding currency code (like$
=>USD
)remove_punct()
now has afast (bool)
kwarg rather thanmethod (str)
because it's easier and clarifies the difference between the two optionsnormalize_contractions()
: this was a clunky, slow, and very limited attempt; better to use a separate, dedicated packagepreprocess_text()
: this was an awkward attempt at user convenience; better to let users mix and match their preprocessing pipeline as neededMotivation and Context
This part of the code base has acquired some cobwebs over the years, and Issue #250 reminded me that more work was required than a hotfix.
Types of changes
Checklist: