Rename certain functions and variables for clarity and consistency with existing conventions
textacy.load_spacy() => textacy.load_spacy_lang()
textacy.extract.named_entities() => textacy.extract.entities(), with ne to ent internally
textacy.data_dir => textacy.DEFAULT_DATA_DIR
filename => filepath and dirname => dirpath when specifying full paths to files/dirs on disk, and textacy.io.utils.get_filenames() => textacy.io.utils.get_filepaths()
compiled regular expressions start with RE_ instead of ending with _RE, using REGEX, etc.
SpacyDoc to Doc, SpacySpan to Span, SpacyToken to Token, SpacyLang to Language as variables and in docs
Remove some deprecated functionality, as planned
top-level spacy_utils.py and spacy_pipelines.py are gone; use spacier subpackage instead
textacy.compat.bytes_to_unicode() and textacy.compat.unicode_to_bytes() are gone; use textacy.compat.to_unicode() and textacy.compat.to_bytes() instead
ftfy dependency is dropped, and a NotImplementedError is raised in textacy's wrapper function, textacy.preprocess.fix_bad_unicode(). (Note: There wasn't any deprecation warning, but since the solution is to replace the call with an equivalent but more powerful call to ftfy.fix_text(), I opted to bundle this in with all these other changes. Sorry, folks!)
Move and rename textacy.text_utils.detect_language() => textacy.lang_utils.detect_lang(), where additional lang-related functionality can get added in the future
Add functionality to finish up recently implemented features
add textacy.spacier.doc_extensions.get_extensions() function to go with set_extensions() and remove_extensions(); it provides a slightly nicer interface over spaCy's current functionality.
add newer datasets (textacy.datasets.IMDB and textacy.datasets.Wikinews) into textacy's CLI so users can download and inspect them, too
add textacy.Corpus.word_counts() and textacy.Corpus.word_doc_counts(), which were punted on during the recent overhaul of the Corpus class (Note: The names have changed, from *_freqs() to *_counts().)
Add and refactor many tests, for both new and old functionality, significantly increasing test coverage
Motivation and Context
This is some much-needed spring cleaning for textacy! Consistently following both internal and external naming conventions should reduce user confusion; improving the test suite means that errors are more likely to be caught; better factoring functionality makes the code more maintainable.
How Has This Been Tested?
passes all the tests, and then some
Types of changes
[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
[x] My code follows the code style of this project.
[x] My change requires a change to the documentation, and I have updated it accordingly.
Description
textacy.load_spacy()
=>textacy.load_spacy_lang()
textacy.extract.named_entities()
=>textacy.extract.entities()
, withne
toent
internallytextacy.data_dir
=>textacy.DEFAULT_DATA_DIR
filename
=>filepath
anddirname
=>dirpath
when specifying full paths to files/dirs on disk, andtextacy.io.utils.get_filenames()
=>textacy.io.utils.get_filepaths()
RE_
instead of ending with_RE
, usingREGEX
, etc.SpacyDoc
toDoc
,SpacySpan
toSpan
,SpacyToken
toToken
,SpacyLang
toLanguage
as variables and in docsspacy_utils.py
andspacy_pipelines.py
are gone; usespacier
subpackage insteadtextacy.compat.bytes_to_unicode()
andtextacy.compat.unicode_to_bytes()
are gone; usetextacy.compat.to_unicode()
andtextacy.compat.to_bytes()
insteadftfy
dependency is dropped, and aNotImplementedError
is raised in textacy's wrapper function,textacy.preprocess.fix_bad_unicode()
. (Note: There wasn't any deprecation warning, but since the solution is to replace the call with an equivalent but more powerful call toftfy.fix_text()
, I opted to bundle this in with all these other changes. Sorry, folks!)textacy.text_utils.detect_language()
=>textacy.lang_utils.detect_lang()
, where additional lang-related functionality can get added in the futuretextacy.spacier.doc_extensions.get_extensions()
function to go withset_extensions()
andremove_extensions()
; it provides a slightly nicer interface over spaCy's current functionality.textacy.datasets.IMDB
andtextacy.datasets.Wikinews
) into textacy's CLI so users can download and inspect them, tootextacy.Corpus.word_counts()
andtextacy.Corpus.word_doc_counts()
, which were punted on during the recent overhaul of theCorpus
class (Note: The names have changed, from*_freqs()
to*_counts()
.)Motivation and Context
This is some much-needed spring cleaning for textacy! Consistently following both internal and external naming conventions should reduce user confusion; improving the test suite means that errors are more likely to be caught; better factoring functionality makes the code more maintainable.
How Has This Been Tested?
passes all the tests, and then some
Types of changes
Checklist: