TheRensselaerIDEA / twitter-nlp

Data Analytics on Twitter with Natural Language Processing
MIT License
17 stars 7 forks source link

Tweet Sentiment Tokenization Pipeline #46

Closed shwehtom89 closed 3 years ago

shwehtom89 commented 3 years ago

This PR adds the first half of the BERT Sentiment classification pipeline. Which is data preparation and preprocessing

  1. Downloads SemEval dataset from dropbox
  2. Reads all datafiles and parses tweets into a pandas dataframe
  3. Amalgamates all data and splits data into training, testing and validation sets
  4. Tokenizes tweet text using BERT Tokenizer

Fine tuning is coming up next