Closed bdzyubak closed 3 months ago
DistilBERT was fine tuned to achieve 0.8 training accuracy and 0.68 validation accuracy. The peak validation loss accuracy was achieved in one epoch after which point validation performance deteriorated severely as the model overfit the training data and lost its generalized pre-trained weights. Due to checkpointing, the best model at the first epoch was saved, so we don't need to worry about the later epochs.
The validation performance is relatively poor, and much lower than the training performance. A larger model might fit the training data better than 80% solving the underfitting issue. On the other hand, I expect even training accuracy to be limited by the observation below. The overfitting may be addressed by freezing most of the network layers and training just the detection head.
The data labels have been augmented by heavily resampling each review into smaller chunks down to one letter. The smaller chunks inherit the label of the original review, so the target sentiment for "A" and "A series", "occasionally amuses" and "none of which amounts to much of a story" all map to the label of the combination of these. Without more intelligent splitting, this may cap the ability of the network to learn sentiments as datapoints like "A"/"A series" will have variable labels.
Next steps: 1) The IMDB dataset is also interesting for sentiment review. Potentially, implement as a separate experiment and then cross validate training on one or both. 2) Implement the other common networks and compare performance. For those that come with out-of-the-box sentiment analysis, evaluate performance without fine-tuning. 3) Compare fine-tuning with frozen layers and only the sentiment head being allowed to train.
Fine-tune Distilbert for movie sentiment review on the following dataset: https://www.kaggle.com/competitions/sentiment-analysis-on-movie-reviews/data
The IMDB dataset is also interesting for sentiment review. Potentially, implement as a separate experiment and then cross validate training on one or both. Implement the other common networks and compare performance. For those that come with out-of-the-box sentiment analysis, evaluate performance without fine-tuning.