chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Add UDHR dataset #271

Closed bdewilde closed 5 years ago

bdewilde commented 5 years ago

Description

Motivation and Context

I wanted a relatively lightweight and comprehensive set of source documents for getting language-specific distributions of characters, for use in certain data augmentation transform functions. The current solution relies on ConceptNet, which is 1. very large and 2. less comprehensive, thus 3. unsatisfactory.

How Has This Been Tested?

Wrote the usual tests, and they all pass. Fixed some other datasets' buggy tests in the process!

Types of changes

Checklist: