The usage of TorchText for learning from bag of words representations was a bottleneck in Hedwig. TorchText is optimized for dealing with plain text and word embeddings, not tf-idf values. It doesn't support compressed input files (and hence we have to store tf-idf values as plain text, leading to files larger than 100 GB), and doesn't seem to be efficient with memory for this case.
This pull request introduces a new trainer, evaluator and data loader classes specifically for dealing with tf-idf representations as sparse matrices. This would allow us to have much larger vocabulary sizes across larger datasets such as IMDB and Yelp.
The usage of TorchText for learning from bag of words representations was a bottleneck in Hedwig. TorchText is optimized for dealing with plain text and word embeddings, not tf-idf values. It doesn't support compressed input files (and hence we have to store tf-idf values as plain text, leading to files larger than 100 GB), and doesn't seem to be efficient with memory for this case.
This pull request introduces a new trainer, evaluator and data loader classes specifically for dealing with tf-idf representations as sparse matrices. This would allow us to have much larger vocabulary sizes across larger datasets such as IMDB and Yelp.