Closed RuiFilipeCampos closed 5 months ago
current dataset has 50k entries
https://www.kaggle.com/datasets/kazanova/sentiment140
https://www.kaggle.com/datasets/bittlingmayer/amazonreviews/data
I suspect that using a pre-trained tokenizer and pre-calculated positional encoding would help a lot
a simple feed forward on top of the learnable tokenizer is enough to overfit
https://github.com/Digital-Defiance/llm-voice-chat/releases/tag/dataset-release-amazon-reviews
current dataset has 50k entries