Closed julienschmidt closed 8 years ago
I volunteer, can you export a bunch for me?
Sure, but this one is low priority. See milestones/Feature freeze for now.
Then next week!
Probably we also need to use another sentiment package. The current one seems really slow.
We also already use this list through our current sentiment package (which we might still replace / extend): http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
@santanumohanta can I assign this issue to you? I'm still working on #71 :/
yes
Thanks!
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon another huge list
any progress @santanumohanta ?
Two things I've noticed
@santanumohanta Might be of use for you!
"yellow dick" is probably a candidate for the blacklist.
Why? No! I mean, yellow dick, such a wow name :laughing:
@julienschmidt classified unclassified characters #47 , will provide update tonight.
based on analysis of some tweets, below there are some words which does not exists in afinn-111 words list, can be added manually for sentiment analysis(probable valance is also added)-
"honorable":2, "nothing":-1 , "fck":-4, "fck u":-4, "coolest":2, "lord":3, "princess":3, "boss":3, "humble":2, "smug":-2, "feast":2, "woo hoo":3, "junk":-2, "independent":2, "beast":-3, "cutest":2, "birth":1, "birthday":2, "bald":-1, "against":-2, "stand":2, "gracias":2, "limit":-2, "tough":-2, "kidnap":-2, "hang":-2, "present":2, "fairy":2, "fairy tale":2, "lengthy":-1, "finer":2, "fought":-3, "power":2, "seize":-2, "spoil":-3, "spoiler":-3, "epic":2, "lord":2, "my lord":2, "puke":-2, "pukes":-2, "fruitless":-2, "offence":-2, "RIP":-3, "fuck off":-4, "fukn":-4, "problematic":-2, "rocks":3, "wretched":-3, "beauty":3, "wise":2, "prayers":2, "prayer":2, "dangers":-3, "lose":-3, "dying:-4" "psychic":-3, "shout":-2, "shouted":-2, "sweetie":2, "dwarf":-3, "imp":-4
as per the data, more than 50% of tweets are in English and almost 30% of tweets are in Spanish. As our afinn-111 word list consists only English words, in case of language other than English the sentiment score is mostly coming as 0
tweet analysis is still in progress,more words to come shortly...
Please submit those as PRs. You can add this list to defaults.json.
And probably we should filter non-English tweets in the aggregation / analysis for now.
@santanumohanta
more than 50% of tweets are in English and almost 30% of tweets are in Spanish.
P.S.: @gyachdav some more interesting facts in the last paragraph of @santanumohanta 's message (CC @marcusnovotny )
mucho cool! can we get a complete break down by languages? would be nice to see how many Urdu tweets we processed.
@gyachdav we have tweets in total 43 different languages.
below are the 20 languages(most tweets belong to) - English 56% Spanish 28% Indonesian 3% Portuguese 2.8% Turkish 1.1% French 1% Italian 1% Welsh 0.70% Tagalog 0.50% German 0.40% Finnish 0.40% Estonian 0.40% Romanian 0.20% Icelandic 0.20% Swedish 0.18% Japanese 0.15% Norwegian 0.14% Dutch 0.11% Polish 0.11% Danish 0.08%
Marami sa Tagalog at pagkatapos ay sa German :laughing:
How did you measure that @santanumohanta ?
"en", "fr", "es", "pt", "sv", "no", "nl", "tr", "de", "in", "ht", "und", "ro", "pl", "th", "cy", "da", "sl", "et", "it", "hi", "fi", "tl", "fa", "cs", "ja", "bg", "ru", "is", "eu", "zh", "hu", "lv", "lt", "ar", "iw", "el", "vi", "ko", "uk", "sr", "ta", "ka"
these are the all languages that we have , after that I calculated percentage per language!
And where did you get the base data per language from?
My current DB is not so big, but here is a quick 'n dirty way: https://github.com/Rostlab/JS16_ProjectD_Group4/compare/lang
{ total: 882070,
positive: 173158,
negative: 143303,
lang:
[ en: 626141,
pl: 739,
cy: 6488,
es: 137483,
fr: 9409,
in: 11423,
und: 9090,
lt: 184,
tr: 9212,
pt: 19007,
it: 7003,
nl: 1304,
sv: 1258,
fi: 1565,
ht: 1999,
tl: 3955,
ro: 1237,
da: 505,
et: 1858,
zh: 122,
de: 26870,
is: 928,
no: 955,
ja: 561,
cs: 471,
eu: 464,
ru: 198,
sl: 96,
el: 141,
hu: 544,
lv: 52,
vi: 35,
ar: 155,
th: 119,
hi: 381,
bg: 22,
uk: 4,
fa: 30,
iw: 2,
ko: 33,
ka: 8,
sr: 14,
ta: 4,
ne: 1 ] }
It shows me very different data
The percentage of tweets with a usable Sentiment score seems rather low. Someone should analyze some of the tweets with score=0 and look for words which we can manually add to the wordlist.