castorini / hedwig

PyTorch deep learning models for document classification
Apache License 2.0
593 stars 125 forks source link

Add native bag of words support for MLP #43

Closed achyudh closed 4 years ago

achyudh commented 4 years ago

The usage of TorchText for learning from bag of words representations was a bottleneck in Hedwig. TorchText is optimized for dealing with plain text and word embeddings, not tf-idf values. It doesn't support compressed input files (and hence we have to store tf-idf values as plain text, leading to files larger than 100 GB), and doesn't seem to be efficient with memory for this case.

This pull request introduces a new trainer, evaluator and data loader classes specifically for dealing with tf-idf representations as sparse matrices. This would allow us to have much larger vocabulary sizes across larger datasets such as IMDB and Yelp.