Improve sentiment module training

Rostlab / JS16_ProjectD_Group4

Joffrey Baratheon is one of the most loathed characters in TV history. As a matter of fact people were celebrating his TV death on Twitter. We are interested to learn more on how people feel about different characters by analyzing tweets mentioning GoT characters. In this project you will be analyzing Twitter feeds across a timeline, you will look for the name of GoT characters in that feed and try to identify whether the tweet is positive or negative. You can then generate a metric that evaluates what is the accumulated sentiment expressed on Twitter for that given character at a given point in time, and what is the trend (positive, negative). It will be interesting to intersect the sentiments for characters following the airing of a certain episode (you can easily get the airing date for an episode from the database constructed in Project A).

GNU General Public License v3.0

0 stars 1 forks source link

Improve sentiment module training #42

Closed julienschmidt closed 8 years ago

julienschmidt commented 8 years ago

The percentage of tweets with a usable Sentiment score seems rather low. Someone should analyze some of the tweets with score=0 and look for words which we can manually add to the wordlist.

retext().use(sentiment, {
    'cat': -3,
    'dog': 3
});

marcusnovotny commented 8 years ago

I volunteer, can you export a bunch for me?

julienschmidt commented 8 years ago

Sure, but this one is low priority. See milestones/Feature freeze for now.

marcusnovotny commented 8 years ago

Then next week!

julienschmidt commented 8 years ago

Probably we also need to use another sentiment package. The current one seems really slow.

julienschmidt commented 8 years ago

http://kt.ijs.si/data/Emoji_sentiment_ranking/

julienschmidt commented 8 years ago

We also already use this list through our current sentiment package (which we might still replace / extend): http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

julienschmidt commented 8 years ago

don't forget https://github.com/Rostlab/JS16_ProjectD_Group4/blob/b12ba8c8a2b4838ab0488e05a8de09558eb939f1/aggregator/aggregator.js#L218

marcusnovotny commented 8 years ago

@santanumohanta can I assign this issue to you? I'm still working on #71 :/

accumen2019 commented 8 years ago

yes

marcusnovotny commented 8 years ago

Thanks!

julienschmidt commented 8 years ago

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon another huge list

julienschmidt commented 8 years ago

any progress @santanumohanta ?

marcusnovotny commented 8 years ago

Two things I've noticed

Paxter Redwyne is still in most hated & most discussed, but the tweets seem to come from an unrelated account with the same name (fits better in #47)
Yellow Dick (never heard of the guy) is among the most hated and his graph shows a continuos red line of entries at the bottom... I guess "dick" has a negative sentiment score :smile:

@santanumohanta Might be of use for you!

julienschmidt commented 8 years ago

"yellow dick" is probably a candidate for the blacklist.

sacdallago commented 8 years ago

Why? No! I mean, yellow dick, such a wow name :laughing:

accumen2019 commented 8 years ago

@julienschmidt classified unclassified characters #47 , will provide update tonight.

accumen2019 commented 8 years ago

based on analysis of some tweets, below there are some words which does not exists in afinn-111 words list, can be added manually for sentiment analysis(probable valance is also added)-

"honorable":2, "nothing":-1 , "fck":-4, "fck u":-4, "coolest":2, "lord":3, "princess":3, "boss":3, "humble":2, "smug":-2, "feast":2, "woo hoo":3, "junk":-2, "independent":2, "beast":-3, "cutest":2, "birth":1, "birthday":2, "bald":-1, "against":-2, "stand":2, "gracias":2, "limit":-2, "tough":-2, "kidnap":-2, "hang":-2, "present":2, "fairy":2, "fairy tale":2, "lengthy":-1, "finer":2, "fought":-3, "power":2, "seize":-2, "spoil":-3, "spoiler":-3, "epic":2, "lord":2, "my lord":2, "puke":-2, "pukes":-2, "fruitless":-2, "offence":-2, "RIP":-3, "fuck off":-4, "fukn":-4, "problematic":-2, "rocks":3, "wretched":-3, "beauty":3, "wise":2, "prayers":2, "prayer":2, "dangers":-3, "lose":-3, "dying:-4" "psychic":-3, "shout":-2, "shouted":-2, "sweetie":2, "dwarf":-3, "imp":-4

as per the data, more than 50% of tweets are in English and almost 30% of tweets are in Spanish. As our afinn-111 word list consists only English words, in case of language other than English the sentiment score is mostly coming as 0

tweet analysis is still in progress,more words to come shortly...

julienschmidt commented 8 years ago

Please submit those as PRs. You can add this list to defaults.json.

And probably we should filter non-English tweets in the aggregation / analysis for now.

sacdallago commented 8 years ago

@santanumohanta

more than 50% of tweets are in English and almost 30% of tweets are in Spanish.

sacdallago commented 8 years ago

P.S.: @gyachdav some more interesting facts in the last paragraph of @santanumohanta 's message (CC @marcusnovotny )

gyachdav commented 8 years ago

mucho cool! can we get a complete break down by languages? would be nice to see how many Urdu tweets we processed.

accumen2019 commented 8 years ago

@gyachdav we have tweets in total 43 different languages.

below are the 20 languages(most tweets belong to) - English 56% Spanish 28% Indonesian 3% Portuguese 2.8% Turkish 1.1% French 1% Italian 1% Welsh 0.70% Tagalog 0.50% German 0.40% Finnish 0.40% Estonian 0.40% Romanian 0.20% Icelandic 0.20% Swedish 0.18% Japanese 0.15% Norwegian 0.14% Dutch 0.11% Polish 0.11% Danish 0.08%

julienschmidt commented 8 years ago

Marami sa Tagalog at pagkatapos ay sa German :laughing:

How did you measure that @santanumohanta ?

accumen2019 commented 8 years ago

"en", "fr", "es", "pt", "sv", "no", "nl", "tr", "de", "in", "ht", "und", "ro", "pl", "th", "cy", "da", "sl", "et", "it", "hi", "fi", "tl", "fa", "cs", "ja", "bg", "ru", "is", "eu", "zh", "hu", "lv", "lt", "ar", "iw", "el", "vi", "ko", "uk", "sr", "ta", "ka"

these are the all languages that we have , after that I calculated percentage per language!

julienschmidt commented 8 years ago

And where did you get the base data per language from?

julienschmidt commented 8 years ago

My current DB is not so big, but here is a quick 'n dirty way: https://github.com/Rostlab/JS16_ProjectD_Group4/compare/lang

{ total: 882070,
  positive: 173158,
  negative: 143303,
  lang:
   [ en: 626141,
     pl: 739,
     cy: 6488,
     es: 137483,
     fr: 9409,
     in: 11423,
     und: 9090,
     lt: 184,
     tr: 9212,
     pt: 19007,
     it: 7003,
     nl: 1304,
     sv: 1258,
     fi: 1565,
     ht: 1999,
     tl: 3955,
     ro: 1237,
     da: 505,
     et: 1858,
     zh: 122,
     de: 26870,
     is: 928,
     no: 955,
     ja: 561,
     cs: 471,
     eu: 464,
     ru: 198,
     sl: 96,
     el: 141,
     hu: 544,
     lv: 52,
     vi: 35,
     ar: 155,
     th: 119,
     hi: 381,
     bg: 22,
     uk: 4,
     fa: 30,
     iw: 2,
     ko: 33,
     ka: 8,
     sr: 14,
     ta: 4,
     ne: 1 ] }

It shows me very different data