bdzyubak / torch-control

A top-level repo for evaluating natively available models
MIT License
2 stars 0 forks source link

Add the IMDB dataset to the sentiment analysis task data #17

Open bdzyubak opened 3 months ago

bdzyubak commented 3 months ago

The IMDB is a popular dataset of movie reviews which contains a review and a positive/negative sentiment. https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

As a first experiment, evaluate models trained on the original sentiment analysis dataset https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data implemented to be able to predict IMDB sentiment. Secondarily, evaluate pooling the data together and training/validating on both datasets. This will require quantizing the original dataset's 5 point reviews to positive/negative as I don't have the data labeling budget to reliably expand the IMDB labels. Neutral reviews may have to be dropped.

I would expect the models trained on the original sentiment dataset to generalize poorly due to the data-augmentation in it - see comment in the original issue: https://github.com/bdzyubak/torch-control/issues/14. Training on both datasets should improve generalizability but may underfit due to the variability in training data labeling and the augmentation.