Dataset.records() now streams (text, metadata) pairs rather than a single dict containing both text and metadata, so users don't need to know field names to split them up; this also makes easier to generate Doc and Corpus from the data.
Filtering and limiting the number of texts/records produced is clearer and more consistent between .texts() and .records() methods on a given Dataset — not to mention more efficient!
Downloading datasets now always shows progress bars, produces the same file names, and some (smaller) datasets are also automatically extracted from archive files for easy inspection.
standardize and consolidate implementation under the hood
every Dataset now has an __iter__ method for iterating over the "raw" data, and a function for stacking and applying filters based on user inputs, which are used by both .texts() and .records()
move general Dataset functionality into a separate datasets.utils module
add more and more thorough tests :)
Motivation and Context
Lots of weirdness and inconsistency within and among the datasets, and a general dissatisfaction with the data structure for records.
How Has This Been Tested?
All tests pass! Even the new ones.
Types of changes
[x] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
[ ] My code follows the code style of this project.
[ ] My change requires a change to the documentation, and I have updated it accordingly.
Description
Dataset.records()
now streams (text, metadata) pairs rather than a single dict containing both text and metadata, so users don't need to know field names to split them up; this also makes easier to generateDoc
andCorpus
from the data..texts()
and.records()
methods on a givenDataset
— not to mention more efficient!Dataset
now has an__iter__
method for iterating over the "raw" data, and a function for stacking and applying filters based on user inputs, which are used by both.texts()
and.records()
Dataset
functionality into a separatedatasets.utils
moduleMotivation and Context
Lots of weirdness and inconsistency within and among the datasets, and a general dissatisfaction with the data structure for
record
s.How Has This Been Tested?
All tests pass! Even the new ones.
Types of changes
Checklist: