chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Extend and refactor extraction and doc representations functionality #329

Closed bdewilde closed 3 years ago

bdewilde commented 3 years ago

Description

Motivation and Context

Despite best efforts, certain corners of textacy have gathered some nasty cobwebs. Functionality for representing documents as networks was particularly gross, suffering from code duplication in two different modules and awkward/unclear intentions behind some functions. The vsm subpackage was also long-neglected — it didn't even have type annotations! — and suffered from clunky, error-prone weighting configuration. Lastly, the spacier Doc extensions were in a weird, frozen place, implemented while spaCy was still expanding upon its extension / customization functionality and while I was still getting used to it.

How Has This Been Tested?

many tests added and improved, all pass

Screenshots (if appropriate):

Types of changes

Checklist: