Runi-VN commented 4 years ago

Preprocessing

Linked with #2

This pull request includes:

Preprocessing.py which provides two functions to use: get_tweet_data - (main) grabs initial data (hashtags, mentions) from scraped tweets, does a minor cleanup and then calls ->
remove_noise() (helper) which tokenizes and further cleans up in form of regex & lemmatization. This function is used for both training the model and handling actual scraped data. The data is returned in three, tweet data, hashtag stats and mentions stats. See docs for more.
setup.ipynb - minor setup guide for local installations
app.py - minor updates to this file, for file loading, etc.

Things that'd be nice to have:

moved to #11

Details from during development

- [x] See tutorials ([DO](https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk), [SA](https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn//), [YT](https://www.youtube.com/watch?v=xvqsFTUsOmc), [TDS](https://towardsdatascience.com/extracting-twitter-data-pre-processing-and-sentiment-analysis-using-python-3-0-7192bd8b47cf)) - [x] Get test data ~~Using the NLTK twitter_sample as provided in the DO guide~~ Using [actual webscraped data](https://github.com/Hold-Krykke/PythonExam/blob/8cebdf123e637dc8031508502c61ac74ac0ed6f2/tweets/trump_biden) from @HrBjarup ### Preprocess in this order: - [x] Handle hashtags (#) - [x] Handle invalid hashtags, such as those that may occur in an URL - [x] [Summarize hashtags](https://stackoverflow.com/a/5829377) (or [this solution](https://stackoverflow.com/a/1692435)) - [x] Summarize mentions - [x] Handle persons (@) - [x] Handle URLs (How? Maybe webscraping can do something here? `http://` wont work, I think) - ~~Handle internal links, such as images (twitter.com/pic.twitter.com)~~ - [x] Handle emojis. Add them to the end of the tweet tokens? - [x] Remove emojis from text - [x] Handle dates ([Notebook](https://github.com/datsoftlyngby/dat4sem2020spring-python/blob/master/notebooks/05-1%20DateTime.ipynb)) - [x] Handle full string - [x] Handle newline characters [such as \n\n](https://github.com/Hold-Krykke/PythonExam/pull/8#issuecomment-623660068). - [x] Remove stop words ([NLTK](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/)) - [x] Remove punctuation, symbols, emojis (regex? **search for similar twitter cleanup cases**) - [x] Normalization - [Lemmatisation](https://en.wikipedia.org/wiki/Lemmatisation) vs. [Stemming](https://en.wikipedia.org/wiki/Stemming) - ~~Consider [removing if checks](https://stackoverflow.com/questions/57151705/should-i-check-if-a-substring-exists-before-trying-to-replace-it)~~ Decided against as testing proved it worthy. - [x] Documentation - [x] Setup.py - [x] Add file loading to app.py ### Possible resources: https://pypi.org/project/tweet-preprocessor/ Initial proposed structure: We will most likely get a list structure with web scraped content, including but not limited to the whole tweet, poster, date. Returning a list with handled specifics and a clean String object. **[Example](https://github.com/Hold-Krykke/PythonExam/pull/8/commits/43aab0b11c5907cbbdc5451b0948946eb1c3061d)**: (Shown as JSON but will probably be `dict` with `list`) ```json //inbound { "tweet": "This is the tweet of the year. #MyFirstTweet @folketinget www.runivn.dk", "author": "Runi Vedel", "date": "01/05/2020", //more... } //outbound - should maybe include original tweet? { "tweet": "This tweet year", "hashtags": [ "#MyFirstTweet" ], "people": [ "@folketinget" ], "urls": [ "runivn.dk" ], "author": "Runi Vedel", "date": "01/05/2020" } ```

MrHardcode commented 4 years ago

You've probably already thought about it but there's a lot of newlines in the raw tweets, so to completely remove all \n from the tweets would be a good idea I think. Or maybe it's even better to replace all newlines with just normal whitespace so it's easier to split the string into separate words using space as the delimiter. Just a thought :brain:

Runi-VN commented 4 years ago

Will update with push tomorrow. I have added an option to remove hashtags and mentions from the original tweet or just their symbol and letting it stay. This is something we may need to discuss further, even though we already decided to remove them.

Input:

{'raw_text': '  🚨🚨Good evening resisters!! 🌊🌊\n\nLooking for more mutuals!! \n\n🚨👉Follow/follow back party for the resistance!! 👈🚨\n\n🎉 🎊 🎉🤣🎉🎊\n\nLess than six months until Nov 3, lets do this! \n\n#Resist #Resistance #Trump #VoteBlueNoMatterWho #VoteBlue2020 #Biden\npic.twitter.com/HAyksU6LDk\n'}

Output:

{'raw_text': ['��🚨🚨good', 'even', 'resister' 
', 'looking', 'mutuals', '��👉👉followfollow', 'back', 'party', 'resistance' 👈👈��' 🎉🎉🤣🤣🎉🎉🎊🎊', 'less', 'six', 'month', 'nov', 'let', 'ance', 'pictwittercomhayksu6ldk'], 
'hashtags': ['#Resist', '#Resistance', '#Trump', '#VoteBlueNoMatterWho', '#VoteBlue2020', '#Biden'], 'mentions': []}

(Some data is omitted)

Stemming seems a little aggressive (lol) and there's a bug with the punctuation removal, resulting in the link not being removed, but instead considered a word.

Spent a lot of time on minor stuff but getting some progress. Need to iron out the bugs and get rid of the damn emojis.

Runi-VN commented 4 years ago

I didn't really have any test data with proper emojis, most were empty, but at least I got the following:

(raw text)

🎡🏾#USA, presidential election #poll :\n\🏾🔼#Biden (D) : 50 % (+2\n⏬⏬#Trump (R) :   41 % (-3)\n\n#MonmouthUniversity, 04/05/20 pic.twitter.com/L39hSBfg7m\n'

holds the following emojis ->

'emojis': ['', '', 'UP-POINTING SMALL RED TRIANGLE']

which cleaned up turns into

'tweet': ['usa', 'presidential', 'election', 'biden', 'trump', 'uppointing', 'small', 'red', 'triangle']

Runi-VN commented 4 years ago

Really good progress today. Cleaned up the code a bit too, and started initial work on setup.py as well as app.py.

Questions for tomorrow:

Do we want the following:

Summarize hashtags, mentions (where? preprocessing or presentation?)

~~den tilpassede data gemmes i en ny folder (/training/hashtag)~~

lav evt. test for at se om dataen er rengjort korrekt

Also:

Consider moving aggressive A-Za-z-regex further down the chain? After lemmatisation?

Castau commented 4 years ago

Looks very good. For me the comments you've written are very helpful - both the "py-doc"? comments and the ones inside the functions! :)

I'm a big fan of the py-doc! And if others find the comments helpful then just leave them in - I do as well, it was just a small concern about the readability

Hold-Krykke / PythonExam

Preprocessing #8

Preprocessing

Things that'd be nice to have:

Input:

Output: