Add native bag of words support for MLP

The usage of TorchText for learning from bag of words representations was a bottleneck in Hedwig. TorchText is optimized for dealing with plain text and word embeddings, not tf-idf values. It doesn't support compressed input files (and hence we have to store tf-idf values as plain text, leading to files larger than 100 GB), and doesn't seem to be efficient with memory for this case.

This pull request introduces a new trainer, evaluator and data loader classes specifically for dealing with tf-idf representations as sparse matrices. This would allow us to have much larger vocabulary sizes across larger datasets such as IMDB and Yelp.

castorini / hedwig

Add native bag of words support for MLP #43