remove the textacy.Doc class, and split its essential functionality into two parts:
make_spacy_doc(): a convenient and flexible function for making spaCy Docs from text or text+metadata pairs with optional spaCy language pipeline specification
a variety of custom doc property and method extensions added directly to the global spacy.tokens.Doc class, accessible via its Doc._ "underscore" property, plus functions for adding/removing the extensions as desired
re-implement the textacy.Corpus class as a collection of spaCy Docs rather than textacy.Docs, and make it better
add documents through a much simpler API, and in the case of many documents, process them using a faster path through the language pipeline
save/load data to disk using a more efficient and robust process
update, add, and improve many tests
significantly improve test coverage of Corpus- and Doc-related functionality, and add a more detailed coverage report to CI builds
significantly reduce overall run-time of running the full test suite
significantly reduce run-time required for initial import textacy by lazy-loading an expensive constant and hiding a couple of heavy imports
update and tidy up documentation throughout the code base, particularly as it relates to Corpus and Doc functionality
bump the minimum required spaCy version: v2.0.0 => v2.0.12
Motivation and Context
Now that spaCy's core objects are customizable, it makes a lot of sense to hook functionality in "natively" rather than maintaining external wrappers and work-arounds.
How Has This Been Tested?
Added and validated many tests. Code coverage for the new functionality is around 95%!
Types of changes
[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
[x] My code follows the code style of this project.
[x] My change requires a change to the documentation, and I have updated it accordingly.
Description
textacy.Doc
class, and split its essential functionality into two parts:make_spacy_doc()
: a convenient and flexible function for making spaCyDoc
s from text or text+metadata pairs with optional spaCy language pipeline specificationspacy.tokens.Doc
class, accessible via itsDoc._
"underscore" property, plus functions for adding/removing the extensions as desiredtextacy.Corpus
class as a collection of spaCyDoc
s rather thantextacy.Doc
s, and make it betterCorpus
- andDoc
-related functionality, and add a more detailed coverage report to CI buildsimport textacy
by lazy-loading an expensive constant and hiding a couple of heavy importsCorpus
andDoc
functionalityMotivation and Context
Now that spaCy's core objects are customizable, it makes a lot of sense to hook functionality in "natively" rather than maintaining external wrappers and work-arounds.
How Has This Been Tested?
Added and validated many tests. Code coverage for the new functionality is around 95%!
Types of changes
Checklist: