castorini / hedwig

PyTorch deep learning models for document classification
Apache License 2.0
593 stars 125 forks source link

Add support for torchtext datasets #79

Closed mikhail-tsir closed 2 years ago

mikhail-tsir commented 2 years ago

This PR adds support for the following datasets found in torchtext

I added dataset classes, as well as adjusted the __main__ scripts for the models to incorporate these datasets.

To download and process them, refer to datasets/README.md (it involves using a different virtual environment because different versions of torchtext are needed). Datasets are placed in .local_data/<dataset_name>.

Models supporting these datasets are:

This PR is really big, but most files are nearly identical (maybe it should be refactored in favor of DRY pattern)?