Hold-Krykke / PythonExam

4. Semester Python Eksamens Projekt
1 stars 1 forks source link

Preprocessing #8

Closed Runi-VN closed 4 years ago

Runi-VN commented 4 years ago

Preprocessing

Linked with #2

This pull request includes:

Things that'd be nice to have:

moved to #11

Details from during development

- [x] See tutorials ([DO](https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk), [SA](https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn//), [YT](https://www.youtube.com/watch?v=xvqsFTUsOmc), [TDS](https://towardsdatascience.com/extracting-twitter-data-pre-processing-and-sentiment-analysis-using-python-3-0-7192bd8b47cf)) - [x] Get test data ~~Using the NLTK twitter_sample as provided in the DO guide~~ Using [actual webscraped data](https://github.com/Hold-Krykke/PythonExam/blob/8cebdf123e637dc8031508502c61ac74ac0ed6f2/tweets/trump_biden) from @HrBjarup ### Preprocess in this order: - [x] Handle hashtags (#) - [x] Handle invalid hashtags, such as those that may occur in an URL - [x] [Summarize hashtags](https://stackoverflow.com/a/5829377) (or [this solution](https://stackoverflow.com/a/1692435)) - [x] Summarize mentions - [x] Handle persons (@) - [x] Handle URLs (How? Maybe webscraping can do something here? `http://` wont work, I think) - ~~Handle internal links, such as images (twitter.com/pic.twitter.com)~~ - [x] Handle emojis. Add them to the end of the tweet tokens? - [x] Remove emojis from text - [x] Handle dates ([Notebook](https://github.com/datsoftlyngby/dat4sem2020spring-python/blob/master/notebooks/05-1%20DateTime.ipynb)) - [x] Handle full string - [x] Handle newline characters [such as \n\n](https://github.com/Hold-Krykke/PythonExam/pull/8#issuecomment-623660068). - [x] Remove stop words ([NLTK](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/)) - [x] Remove punctuation, symbols, emojis (regex? **search for similar twitter cleanup cases**) - [x] Normalization - [Lemmatisation](https://en.wikipedia.org/wiki/Lemmatisation) vs. [Stemming](https://en.wikipedia.org/wiki/Stemming) - ~~Consider [removing if checks](https://stackoverflow.com/questions/57151705/should-i-check-if-a-substring-exists-before-trying-to-replace-it)~~ Decided against as testing proved it worthy. - [x] Documentation - [x] Setup.py - [x] Add file loading to app.py ### Possible resources: https://pypi.org/project/tweet-preprocessor/ Initial proposed structure: We will most likely get a list structure with web scraped content, including but not limited to the whole tweet, poster, date. Returning a list with handled specifics and a clean String object. **[Example](https://github.com/Hold-Krykke/PythonExam/pull/8/commits/43aab0b11c5907cbbdc5451b0948946eb1c3061d)**: (Shown as JSON but will probably be `dict` with `list`) ```json //inbound { "tweet": "This is the tweet of the year. #MyFirstTweet @folketinget www.runivn.dk", "author": "Runi Vedel", "date": "01/05/2020", //more... } //outbound - should maybe include original tweet? { "tweet": "This tweet year", "hashtags": [ "#MyFirstTweet" ], "people": [ "@folketinget" ], "urls": [ "runivn.dk" ], "author": "Runi Vedel", "date": "01/05/2020" } ```

MrHardcode commented 4 years ago

You've probably already thought about it but there's a lot of newlines in the raw tweets, so to completely remove all \n from the tweets would be a good idea I think. image Or maybe it's even better to replace all newlines with just normal whitespace so it's easier to split the string into separate words using space as the delimiter. Just a thought :brain:

Runi-VN commented 4 years ago

Will update with push tomorrow. I have added an option to remove hashtags and mentions from the original tweet or just their symbol and letting it stay. This is something we may need to discuss further, even though we already decided to remove them.

Input:

{'raw_text': '  🚨🚨Good evening resisters!! 🌊🌊\n\nLooking for more mutuals!! \n\n🚨👉Follow/follow back party for the resistance!! 👈🚨\n\n🎉 🎊 🎉🤣🎉🎊\n\nLess than six months until Nov 3, lets do this! \n\n#Resist #Resistance #Trump #VoteBlueNoMatterWho #VoteBlue2020 #Biden\npic.twitter.com/HAyksU6LDk\n'}

Output:

{'raw_text': ['��🚨🚨good', 'even', 'resister' 
', 'looking', 'mutuals', '��👉👉followfollow', 'back', 'party', 'resistance' 👈👈��' 🎉🎉🤣🤣🎉🎉🎊🎊', 'less', 'six', 'month', 'nov', 'let', 'ance', 'pictwittercomhayksu6ldk'], 
'hashtags': ['#Resist', '#Resistance', '#Trump', '#VoteBlueNoMatterWho', '#VoteBlue2020', '#Biden'], 'mentions': []}

(Some data is omitted)

Stemming seems a little aggressive (lol) and there's a bug with the punctuation removal, resulting in the link not being removed, but instead considered a word.

Spent a lot of time on minor stuff but getting some progress. Need to iron out the bugs and get rid of the damn emojis.

Runi-VN commented 4 years ago

I didn't really have any test data with proper emojis, most were empty, but at least I got the following:

(raw text)

🎡🏾#USA, presidential election #poll :\n\🏾🔼#Biden (D) : 50 % (+2\n⏬⏬#Trump (R) :   41 % (-3)\n\n#MonmouthUniversity, 04/05/20 pic.twitter.com/L39hSBfg7m\n'

holds the following emojis ->

'emojis': ['', '', 'UP-POINTING SMALL RED TRIANGLE']

which cleaned up turns into

'tweet': ['usa', 'presidential', 'election', 'biden', 'trump', 'uppointing', 'small', 'red', 'triangle']
Runi-VN commented 4 years ago

Really good progress today. Cleaned up the code a bit too, and started initial work on setup.py as well as app.py.

Questions for tomorrow:

Do we want the following:

  • Summarize hashtags, mentions (where? preprocessing or presentation?)
  • den tilpassede data gemmes i en ny folder (/training/hashtag)
  • lav evt. test for at se om dataen er rengjort korrekt

Also:

Consider moving aggressive A-Za-z-regex further down the chain? After lemmatisation?

Castau commented 4 years ago

Looks very good. For me the comments you've written are very helpful - both the "py-doc"? comments and the ones inside the functions! :)

I'm a big fan of the py-doc! And if others find the comments helpful then just leave them in - I do as well, it was just a small concern about the readability