Stefanos-stk / Bertmoticon

Multilingual Emoticon Prediction of Tweets about COVID-19😷
7 stars 3 forks source link

Blog #2

Closed Stefanos-stk closed 2 years ago

Stefanos-stk commented 4 years ago

-5/25/2020 Created the repo for the Covid19Twitter project

Stefanos-stk commented 4 years ago

-5/26/2020 Read the documentation for the Nvidia Sentiment. Encountered multiple errors with pytorch and its incompatibilities with python. Started running some test.csv files with the run_classifier.

Stefanos-stk commented 4 years ago

5/27/2020 Read the Nvidia related paper about this codebase. Found it very interesting that with so little data it was able to outperform other algorithms with millions of data.

The conferences I will be interested in are the following: -http://www.netcopia.net/nlp4if/ : NLP for freedom -https://healthlanguageprocessing.org/smm4h/ Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task 2020 -https://www.pheme.eu/rdsm2020/ 3rd International Workshop on Rumours and Deception in Social Media

Stefanos-stk commented 4 years ago

-5/28/2020 Gathered tweets in order to test the Nvidia sentiment analysis. Tags that I used were the following: hashtags = ['the corona virus','coronavirus','covid_19','covid19','seasonal_flu','sars-cov-2','orthocoronavirinae','conjunctivitis','coronaviridae','gripe', 'gripa', 'influenza','流行性感冒','冠狀病>毒亞科', '2019年冠狀病毒疾病', '疫情', '大流行', '冠肺炎', '코로나바이러스', '코로나19', '인플루엔자', '유행성감기', 'インフルエンザ', 'コロナウイルス'] The tweets are variations of how people talk about the coronavirus across the world, I mainly obtained them via Wikipedia and switching languages, twitter in different languages and news-websites.

The code for traversing the tweets and looking for the keywords is the following:

                            #load the tweet as a python dictionary
                            tweet = json.loads(line)
                            #convert text to lower case
                            sentence = tweet['text'].lower()                                                        
                            for hashtag in hashtags:
                                if hashtag in sentence:
                                    hits += 1
                                    row = {'Tweets': str(sentence),
                                        'Hashtags':str(hashtag),
                                        'json':tweet}
                                    thewriter.writerow(row)    
Stefanos-stk commented 4 years ago

-5/29/2020 Today's task was to create a generic function that will streamline the process of getting the 8-sentiment analysis from a particular list where the user chooses between transformer or mLSTM.

def get_sentiment(text,model_used):
        mf.make(text)
        model_file_name  = '--load=' + model_used + '_semeval.clf'
        print(model_file_name)
        argv = ['--data=temp.csv', '--text-key=tweets', model_file_name,'--write-results=output.csv']
        #argv = ['--data=qwe.csv', '--text-key=Tweets', '--write-results=output.csv', '--load=transformer_semeval.clf']
        (train_data, val_data, test_data), tokenizer, args = get_data_and_args(argv=argv)

        if model_used is 'mlstm':
            args.model = model_used 
            if get_sentiment_model['mlstm'] is None:
                get_sentiment_model['mlstm'] = get_model(args)
            ypred, yprob, ystd = classify(get_sentiment_model['mlstm'], train_data, args)    

        if model_used is 'transformer':
            args.model = model_used 
            if get_sentiment_model['transformer'] is None:
                get_sentiment_model['transformer'] = get_model(args)
            ypred, yprob, ystd = classify(get_sentiment_model['transformer'], train_data, args)

        print(args.classes)
        print("yprob",yprob)
        return yprob

This is a function that I tested it by running it inside the main run_classifier() method This function required an auxiliary make_file.py python file:

import csv
# Open File
temp = open("temp.csv",'w')
def make(list):
    temp.write('tweets'+"\n")
    for r in list:
        temp.write(r + "\n")
    temp.close()

This transforms the text input to the get_sentiment function into a temporary csv file.

An initial problem I encountered was that I could overwrite the args. conditions, however the run_classifier function would always set_defaults values to parameters due to the lack of arguments after python3 run_classifier. The solution was to pass in the fixed parameters via the function:

def get_data_and_args(argv = None)

This way the file could run by itself without getting disrupted by the get_sentiment function.

Another note is that when invoking the get_model(args) function, for the sake of the saving a lot of time not rebuilding models every time we run the `get_sentiment(text,model_used) function, we create a dictionary:

get_sentiment_model = {
    'mlstm': None,
    'transformer': None
}

Where we can store instances of the models.

Example: Input:

get_sentiment(['I AM ON A HIGHWAY TO HELL','I dont hate it I love it ! '],'transformer')

Output:

0.308 seconds to transform 2 examples
['anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust']
yprob [[ 0.07608929  0.02504291  0.12360426  0.03401234  0.2915366   0.03727515
   0.00115355  0.02476798]
 [ 0.1120083   0.04536616  0.13510057  0.05286434  0.08265216  0.07356632
   0.00169791  0.05606586]]
mikeizbicki commented 4 years ago

A similar project to be aware of and double check that we get similar results: https://github.com/delvinso/covid19_one_hundred_million_unique_tweets

Stefanos-stk commented 4 years ago

-6/2/2020 Created graphs representing the data collected. Initially, I familiarized myself with the SQL bot in order to get the needed data for representation. Then, I figured the best way to make the graphs was using the spyder editor provided by anaconda navigator (it is the one I used for my math homework). I had some trouble understanding how matplotlib works however I managed to generate some graphs which I will include below. The graph showing the volume of tweets matched pretty accurately the one from the "one hindered million unique tweets" Github project without even using 100 million tweets as data. In addition, I made a pie chart for the different country codes. In comparison with 100 million tweets, it is kind of accurate (US percent 30.5 % (mine) vs 55.5%); I believe that given enough tweets we should be able to reach that percentage. number1 number2 pieChart

mikeizbicki commented 4 years ago

Great progress. And there's about 500 million tweets in the dataset we're looking at. There's not that many with #coronavirus, but there's lots of other ways to talk about the coronavirus than just that one hashtag.

Stefanos-stk commented 4 years ago

I created some more graphs revising the one's I made yesterday: #coronavirus tweets #countrycode

In addition, I started creating a heatmap using the tweet count from each country: my goal with the heatmap is to display 3-4 consecutive pictures showing the "explosion" of tweets as certain dates.

I also found important dates that are related to the coronavirus such as: -W.H.O naming it a pandemic -President Trump shutting down the borders to Europe -One of the first instances of coronavirus in news outlets

I am going to include some of those events in the graphs & see how different hashtags come about.

I am having an issue with collecting some data. I typed inside the psql bot:

novichenkobot=> SELECT
novichenkobot->     date_trunc('day',created_at) as day,
novichenkobot->     count(*) as count
novichenkobot-> FROM twitter.tweet_tags
novichenkobot-> INNER JOIN twitter.tweets ON tweet_tags.id_tweets = tweets.id_tweets
novichenkobot-> WHERE
novichenkobot->     lower(tag) = '#covid19' 
novichenkobot-> GROUP BY day
novichenkobot-> ORDER BY day DESC;
^CCancel request sent
ERROR:  canceling statement due to user request
Time: 385419.117 ms (06:25.419)

It was working fine about 8hrs ago, however right now it does not seem to be responding. Also, the day with low #coronavirus tweets is 2020-03-18 with 4396 tweets. I am going to fix the issues you mentioned with the labels.

All in all the plan for the graphs is: -heatmaps -# trends w/important events -maybe a graph correlating with the no of confirmed cases.

mikeizbicki commented 4 years ago

This query will stop all of your running db processes

select pid,query,pg_cancel_backend(pid) from pg_stat_activity where usename='ssaa2018' and query not like '%pg_cancel_backend%';
Stefanos-stk commented 4 years ago

Quick Question: I am trying to gather tweets with similar hashtags:

SELECT
    date_trunc('day',created_at) as day,
    count(*) as count
FROM twitter.tweet_tags
INNER JOIN twitter.tweets ON tweet_tags.id_tweets = tweets.id_tweets
WHERE
    lower(tag)= '#covid19' OR lower(tag)= '#covid_19' OR lower(tag) = '#covid-19',
GROUP BY day
ORDER BY day DESC;
    ''')
res = connection.execute(sql,{
   # 'tag':'
    })

But I am getting this error:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.SyntaxError) syntax error at or near ","
LINE 8: ...d19' OR lower(tag)= '#covid_19' OR lower(tag) = '#covid-19',
mikeizbicki commented 4 years ago

Replace

    lower(tag)= '#covid19' OR lower(tag)= '#covid_19' OR lower(tag) = '#covid-19',

with

    lower(tag)= '#covid19' OR lower(tag)= '#covid_19' OR lower(tag) = '#covid-19'

There should be no , at the end of the line.

Stefanos-stk commented 4 years ago

Another question: I am trying to split the data that I get from countries in order to get the 3 heatmaps. I tried this (I added the date_trunc) function.

SELECT 
    date_trunc('day',created_at) as day,
    country_code,
    count(*) as count
FROM twitter.tweet_tags
INNER JOIN twitter.tweets ON tweet_tags.id_tweets = tweets.id_tweets
WHERE
    tag = '#coronavirus'
GROUP BY country_code
ORDER BY count DESC;

But I got an error:

ERROR:  column "tweets.created_at" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2:     date_trunc('day',created_at) as day,
Stefanos-stk commented 4 years ago

I started experimenting with heatmaps of the world, in order to portray the number of tweets each country made across some timeline. I used the library geopandas, and its default map. I had to do some conversion in excel due to the fact that the data gathered by the psql bot were mapped with ISO 2-letter country codes, whereas the geopandas need ISO 3-letter country codes. In addition, there exist some gaps in the library that I had to hardcode. An example: plot

I also gathered some additional tweet counts that I am going to include the graph: -#trump -#stayhome,#stayathome,#stayinside -#covid19,#covid-19,#covid_19 -#coronavirus, #corona -#lockdown

mikeizbicki commented 4 years ago

Looks good.

Also, this command should do what I think you wanted:

SELECT 
    date_trunc('day',created_at) as day,
    country_code,
    count(*) as count
FROM twitter.tweet_tags
INNER JOIN twitter.tweets ON tweet_tags.id_tweets = tweets.id_tweets
WHERE
    tag = '#coronavirus'
GROUP BY day,country_code
ORDER BY day DESC,country_code DESC;
Stefanos-stk commented 4 years ago

Today for some reason I am not able to use the vscode server again I am getting the following error when I try to re-install the vs code files. I cleaned the bash history but that did not work. I am not sure how to clean my cache and not sure if I have permission to do that.

mkdir: cannot create directory ‘.vscode-server’: Disk quota exceeded

I am also getting this error when trying to exit the psql bot:

could not save history to file "/home/ssaa2018/.psql_history": Disk quota exceeded
Stefanos-stk commented 4 years ago

6/4/2020 Today I created one of the final versions of the graphs using maplotlib. I downloaded data for #coronavirus: other related hashtags

covid19: //

lockdown: //

stahome: //

In addition, I added important key dates such as: WHO declaring covid19 a pandemic etc. In today's call, I also got a lot of useful feedback on what I should consider adding to the graph in order to make it more information such as other hashtags: #wuhaam, #chinesevirus etc.

Also, I am planning to find more specific # trends that happened in the USA, including the state they were sent from. graph22

The code for this map is the following:

"""

@author: Stefanos Stoikos
"""

#import libraries
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

#corona
coronatweets = pd.read_csv('datatweetcount/#coronavirusdata.csv', usecols=['day','count'], parse_dates=['day'])
coronatweets.set_index('day',inplace=True)

#covid19
covid19 = pd.read_csv('datatweetcount/#covid19data.csv', usecols=['day','count'], parse_dates=['day'])
covid19.set_index('day',inplace=True)

#lockdown
lockdown = pd.read_csv('datatweetcount/#lockdowndata.csv', usecols=['day','count'], parse_dates=['day'])
lockdown.set_index('day',inplace=True)

#stayhome
stayhome = pd.read_csv('datatweetcount/#stayathomedata.csv', usecols=['day','count'], parse_dates=['day'])
stayhome.set_index('day',inplace=True)

#trump
trump = pd.read_csv('datatweetcount/#trumpdata.csv', usecols=['day','count'], parse_dates=['day'])
trump.set_index('day',inplace=True)

fig, ax = plt.subplots(figsize=(15,7))
ax.xaxis.set_major_locator(mdates.WeekdayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

ax.plot(coronatweets.index, coronatweets['count'],label = '#coronavirus') 
ax.plot(stayhome.index, stayhome['count'],label = '#stayhome',alpha=0.3)   
ax.plot(lockdown.index, lockdown['count'],label = '#lockdown',alpha=0.3)  
ax.plot(covid19.index, covid19['count'],label = '#covid19')  

ax.legend()
ax.xaxis.set_major_locator(mdates.DayLocator(interval = 15))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

print(lockdown.index[82])

ax.annotate('W.H.O declares a pandemic       \n Trump announces Europe travel ban       ', (mdates.date2num(coronatweets.index[79]), coronatweets['count'][79]), xytext=(15, 15), 
            textcoords='offset points', arrowprops=dict(arrowstyle='-|>'),horizontalalignment='right')

ax.annotate('First US case reported', (mdates.date2num(coronatweets.index[130]), coronatweets['count'][130]), xytext=(15, 15), 
            textcoords='offset points', arrowprops=dict(arrowstyle='-|>'),horizontalalignment='right')

ax.annotate('Italy under lockdown', (mdates.date2num(lockdown.index[82]), lockdown['count'][79]), xytext=(15, 15), 
            textcoords='offset points', arrowprops=dict(arrowstyle='-|>'),horizontalalignment='left')
fig.autofmt_xdate()

for tick in ax.get_xticklabels():
    tick.set_rotation(45)

fig.savefig('graph2.pdf', format='pdf', dpi=1200)

Regarding the heatmap, I wrote some code so that I can split the #trends into different days. This will allow me to find specific days of interest and easily map them with the code I have already written for graphing on the map which is the following:

import pandas as pd 
import matplotlib.pyplot as plt 
import geopandas as gpd

data = pd.read_csv('country.csv')

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

for_plotting = world.merge(data, left_on = 'iso_a3', right_on = 'Countrycode')

for_plotting.info() 

ax = for_plotting.dropna().plot(column='tweets', cmap =    
                                'hot', figsize=(15,9),   
                                 scheme='quantiles', k=6, legend =  
                                  True);
ax.set_axis_off()
ax.get_legend().set_bbox_to_anchor((.12,.12))
ax.get_figure()
ax.savefig('plot.png')

Finally, the code for mapping the separate files for each day is the following:

a =[]
b =[]
c =[]
flag = True
prev = 'start'
for row in res:
    print(row['day'], row['count'],row['country_code'])
    if str(prev) == str(row['day']) or flag:
        #print(prev)
        flag = False 
        a.append(row['day'])
        b.append(row['count'])
        c.append(row['country_code'])
        prev = str(row['day'])
    else: 
        df = pd.DataFrame({'day':a, 'count':b,'country':c})
        name = 'datacount/' + str(row['day']) + '.csv'
        df.to_csv(name, index=False)
        a =[]
        b =[]
        c =[]
        a.append(row['day'])
        b.append(row['count'])
        c.append(row['country_code'])
        prev = str(row['day'])
mikeizbicki commented 4 years ago

Great work!

Make sure to also be continuously updating your latex code with these figures and a short description of them. That'll make you a lot more productive later on when writing the paper itself.

Stefanos-stk commented 4 years ago

I think I don't have permission to use the GPUs because I am getting this error:

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.
mikeizbicki commented 4 years ago

Weren't you able to use the GPU before? I just checked and it looks to me like you still have permission.

mikeizbicki commented 4 years ago

I'm in a faculty meeting right now, so I can't meet right now.

Stefanos-stk commented 4 years ago

I tried it again now and it is now working fine. Sorry about the confusion

Stefanos-stk commented 4 years ago

6-8-2020 Fixed the issue for the world heatmap. The problem was that when searching for tweet count in countries where the count was zero they would no show up in the CSV outcome file. It would only show the countries with more than 1 tweet count. To fix that I imported a list full of ISO 3 letter country codes and filled the gaps. In addition, geopandas which is the library that allows me to generate the heatmaps require 3 ISO letter country codes. The SQL code exports the data in 2 ISO letter country codes. I fixed it using the library country_converter. Here is the code for the above 2 issues:

import country_converter as coco
import glob
import pandas as pd

ds = pd.read_csv("isocodes.csv",encoding = "utf-8")
isocodes = list(ds['x'])
print(isocodes)
path = "datacount/*.csv"
for fname in glob.glob(path):
    print(fname)
    df = pd.read_csv(fname)
    day = df['day']
    count = list(df['count'])
    country = list(df['country'])                
    isofix = coco.convert(names=country, to='ISO3')
    for isoname in isocodes:
        if isoname not in isofix:
            isofix.append(isoname)
            count.append(0)
    dx = pd.DataFrame()
    dx['country'] = isofix
    dx['count'] = count
    dx.to_csv("new/"+day[2]+".csv")

Secondly, I was able to write a small patch of code in order to run the get_sentiment function I created (1st week) in a batch of data from the SQL database:

list_final = []
list_tweets =[]
list_id =[]
count =0
for row in res:
    list_tweets.append(row['text'])
    list_id.append(row['id_tweets'])
    count += 1
results = run_classifier.get_sentiment(list_tweets,'mlstm')

for i in range (count):
    list_final.append((list_id[i],results[i][0],results[i][1],results[i][2],results[i][3],results[i][4],results[i][5],results[i][6],results[i][7],'mlstm'))
print(list_final)

I am not sure if this the most efficient way of parsing the data since with large amounts of tweets we will probably hit an index out of bounds error (I am using lists). However, this is an example of the code working: (capped the tweets to 5)

[(1214943523499175936, 0.08518246, 0.009998031, 0.0866783, 0.014641507, 0.3289743, 0.0049021645, 0.0016003243, 0.0008433684, 'mlstm'), (1215174342012801025, 0.9710939, 0.000102594095, 0.9881555, 0.79603416, 2.1903155e-05, 0.012944943, 3.27495e-06, 3.1481832e-08, 'mlstm'), (1215251351762100225, 4.123685e-05, 0.0006226951, 0.0010469632, 1.3450021e-05, 0.9856888, 0.6450219, 0.00039895647, 0.0015942188, 'mlstm'), (1215276018379952128, 0.036344554, 0.0019209883, 0.02761112, 0.091166474, 0.026907803, 0.0019966026, 2.5076954e-06, 3.91939e-06, 'mlstm'), (1215538847238328320, 0.00015610691, 0.0611618, 0.00055008056, 0.008454597, 0.942403, 0.0010393397, 0.0068206913, 0.00660869, 'mlstm')]

It returns in the way of (tweet_id,anger,anticipation,disgust,fear,joy,sadness,surprise,trust,model). Tomorrow I want to parse the data findings into the database using the SQL code.

The time it took to find sentiment for 100 tweets: Time: 6.47s (I used library timeit) 1000 tweets: Time: 14.88s

Finally, I did some more research on what # were used during the pandemic (still trying to find popular # for south Korea from this website): https://getdaytrends.com/korea/trend/%EC%82%AC%ED%8D%BC%20%EC%84%AD%EC%A2%85/

    lower(tag) = '#coronavirus' OR lower(tag) = '#covid19'
    OR   lower(tag) = '#covid_19' OR  lower(tag) ='#corona' 
    OR  lower(tag) = '#chinesevirus' OR  lower(tag) = '#lockdown'
    OR lower(tag) = '#sars-cov-2' OR lower(tag) = '#coronaviridae'
    OR lower(tag) = '#conjunctivitis' OR lower(tag) = '#wuhan'
    OR tag = '#코로나 바이러스' OR tag = '#코로나 바이러스 감염병 세계적 유행' 
    OR tag = '#코로나 19' OR tag = '#폐쇄' 
    OR tag = '#コロナウイルス' OR  tag =   '#コロナウイルスパンデミック'
    OR tag = '#新冠肺炎' OR tag= '#新冠病毒'
    OR tag = '#冠状病毒病'  OR tag = '#武汉' 
    OR tag = '#coronovirius' OR lower(tag) = '#coronaviruspandemic'
    OR lower(tag) = '#coronapocalypse' OR lower(tag) = '#stayhomesavelives'
    OR lower(tag) = '#quarantinelife' OR lower(tag) = '#socialdistanacing'
    OR tag = '#कोरोनावाइरस' OR tag = '#कोविड 19'
    OR tag = '#코로나19' 
Stefanos-stk commented 4 years ago

Main issue right now is getting the data up in the database again, I have tried multiple ways none of them seem to be working. I think I am quite close though.

I believe I have fixed the issue all the data seems to be uploading just fine. How many should I let through I have tried it with 100 so far.

The code is the following:

def insert(zz):
    id_tweets,anger,anticipation,disgust,fear,joy,sadness,surprise,trust,model = zz
    anger = float(anger)
    anticipation = float(anticipation)
    disgust = float(disgust)
    fear = float(fear)
    joy = float(joy)
    sadness = float(sadness)
    surprise = float(surprise)
    trust = float(trust)
    model = str(model)
    print(id_tweets,anger,anticipation,disgust,fear,joy,sadness,surprise,trust,model)
    connection.execute("INSERT INTO twitter.tweet_sentiment VALUES (%s, %s, %s, %s,%s, %s, %s,%s,%s,%s);", (id_tweets,anger,anticipation,disgust,fear,joy,sadness,surprise,trust,model))

for x in list_final:
    insert(x)
Stefanos-stk commented 4 years ago

It seems that the while loop gets stuck occasionally and this error is displayed:

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 155

The number after row varies after I re-run it

mikeizbicki commented 4 years ago

I'm on zoom right now if you want me to take a look.

On Thu, 2020-06-11 at 13:00 -0700, Stefanos-stk wrote:

It seems that the while loop gets stuck occasionally and this error is displayed: pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 155 The number after row varies after I re-run it — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Stefanos-stk commented 4 years ago

I fixed the issue and I want to delete the previous data, I used the command and I got this:

novichenkobot=> delete from twitter.tweet_sentiment;
ERROR:  permission denied for relation tweet_sentiment
Time: 0.354 ms

Should I do 100 or 1000 per iteration?

mikeizbicki commented 4 years ago

Whoops, I forgot to give you the delete permissions to that table. It should work now.

On Thu, 2020-06-11 at 13:53 -0700, Stefanos-stk wrote:

I fixed the issue and I want to delete the previous data, I used the command and I got this: novichenkobot=> delete from twitter.tweet_sentiment; ERROR:  permission denied for relation tweet_sentiment Time: 0.354 ms — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

mikeizbicki commented 4 years ago

Whatever number per iteration results in the fastest results. My guess is that the larger the number, the faster it will go.

Stefanos-stk commented 4 years ago

-6/11/2020

Today I was able to fix some issues with some csv files and the Nvidia discovery codebase. I was getting an error that had to do with the way I was parsing files into the codebase. I had 2 auxiliary .py files that would create and delete temp.csv files to override the systems argument function of the Nvidia sentiment discovery. I had to switch to making the temp files in pandas:

def make(list_):
    df = pd.DataFrame(list_, columns=['tweets'])
    df.to_csv('temp.csv')

With the same (default) settings the codebase was coded to deal with. Upon fixing that I dealt with some other issues such as, emptying lists so they don't run out of indices and delete other aux files that were created during the process of getting the sentiment. I set the number of tweets to be evaluated at 1000 per iteration because that seemed to achieve the most amount of tweets labeled per minute. The tweet count from right now as seen from the psql bot is 32 000 (3:11 PM PST). At 3:16 AM there are 169 000 uploaded rated tweets.

Regarding the graphs, I am having some difficulties locating tweets (#hashtags) from South Korea which was greatly affected by corona. I expect that with searching for keywords instead of # will bring more results. Finally, I will update the count graph of (#hashtags) due to the fact that covid19 was mentioned a lot of times due to the ongoing protests.

Stefanos-stk commented 4 years ago

count graph, tags used for each category:

-covid/corona:

lower(tag) = '#coronavirus' OR lower(tag)= '#corona' OR tag = '#新冠病毒' 
    OR tag = '#코로나 바이러스' OR tag = '#कोरोनावाइरस'
    OR lower(tag) = '#koronavírus' OR lower(tag) = '#koronawirus' 
    OR lower(tag)= '#coronaviruse' OR lower(tag) = '#coronaviridae'
    OR lower(tag) = '#coronapocalypse' OR lower(tag) = '#coronaviruspandemic'
    OR lower(tag) = '#covid19' OR  lower(tag) = '#covid_19'
    OR lower(tag) = '#covid-19' OR lower(tag) = '#sars-cov-2'
    OR tag = '#코로나 19' OR tag = '#कोविड 19'
    OR tag = '#코로나19'  OR lower(tag) = '#covid'

-lockdown/stayinghome:

    lower(tag) = '#lockdown' OR tag= '#폐쇄' OR tag = '#封鎖'
    OR lower(tag) = '#confinement' OR lower(tag)='#confinamento'
    OR tag = '#लॉकडाउन' OR lower(tag) = '#stayhomesavelives'
    OR lower(tag) = '#stayhome' OR lower(tag) = '#stayinside'
    OR lower(tag) = '#stayathome' OR lower(tag) = '#quarantine'
    OR lower(tag) = '#quarantinelife' OR tag = '#संगरोध'
    OR lower(tag) = '#quarantaine' OR lower(tag) = '#quarantäne'
    OR lower(tag) = '#ausgangssperre' OR tag = '#건강 격리'
    OR tag = '#検疫'

I have also collected individual tweets counts for every hashtag, so that a bar graph will be included underneath the trend graph

Website for corona timeline: https://www.who.int/news-room/detail/27-04-2020-who-timeline---covid-19 https://www.nytimes.com/article/coronavirus-timeline.html https://www.euronews.com/2020/04/02/coronavirus-in-europe-spain-s-death-toll-hits-10-000-after-record-950-new-deaths-in-24-hou

I am including some of the final graphs: (heatmaps) 2020-03-14 2020-03-15

finalcountgraph1

Stefanos-stk commented 4 years ago

sent_1 So this is the first image that I got from the data, there is a cool little split that happens between anger & disgust vs. joy. There is some initial noise on the data in January, I am pretty confident that its because: one day on the existing dataset we had no data, and the days after that we had very little tweets. I am already letting more tweets get evaluated on GPU:0,1. It has been 2 hours and I haven't received any results yet. I used this command: CUDA_VISIBLE_DEVICES=0 nohup python3 sqlSent.py & -Could you also send me the translations? -I checked again the number of rated tweets in the SQL dataset, however it seems to be the number. It has been I think 10 hours since I run that command so I am not sure what's wrong. What is the command to check the psql requests? -For the sentiment data, do you have a particular graph in mind that I should consider (I was thinking standard dev) Number of tweets rated: 179 887

Stefanos-stk commented 4 years ago

So I have fixed the error with the SQL dataset; however, now I realized that in order to run the Nvidia sentiment analysis code, we had to do downgrade pytorch. I am getting this error : AttributeError: 'DataLoader' object has no attribute '_dataset_kind' which has to do with pytorch. The issue according to the threads posted under Nvidia is to: !pip install torch==1.0.1 torchvision==0.2.2 So my plan is to do that a bit later before I go to sleep. In the meantime, I am going to configure the hyperparameters of beer. My plan is to run overnight the remaining 1 million-ish tweets through Nvidia.

Update: I started running sqlSent.py file for rating and uploading tweets in the dataset however after some iterations I noticed that it stopped and this is the error that I got:

psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "tweet_sentiment_pkey"
DETAIL:  Key (id_tweets)=(1243237224465846272) already exists.

[SQL: INSERT INTO twitter.tweet_sentiment VALUES (%s, %s, %s, %s,%s, %s, %s,%s,%s,%s);]
[parameters: (1243237224465846272, 0.005437786690890789, 0.006194005720317364, 0.008206545375287533, 4.213873398839496e-05, 0.9130354523658752, 0.0017584455199539661, 0.00027012662030756474, 0.0007652142667211592, 'mlstm')]
(Background on this error at: http://sqlalche.me/e/gkpj)
Stefanos-stk commented 4 years ago

https://tensorboard.dev/experiment/a24FM997S7CLtDboCbIJ2g/#scalars&_smoothingWeight=0.999&runSelectionState=eyJtb2RlbD1iZXJ0X2hpZGRlbj0xMjhfbGF5ZXJzPTFfY29uZD1GYWxzZV9yZXNuZXQ9RmFsc2VfbHI9NWUtMDVfb3B0aW09YWRhbV9jbGlwPVRydWVfMjAyMC0wNi0yNSAxMToyNjowMC45MTg2OTMiOnRydWV9

I ran a lot of different variations of the model. I played around with some other model options such as --resnet --hidden_layer_size and --num_layers --batch_size and some of them seemed to aid the model to perform better. I have uploaded the better results of my findings.

I found an article about fine-tuning: https://medium.com/@prakashakshay90/fine-tuning-bert-model-using-pytorch-f34148d58a37 Batch size: 16, 32 Learning rate (Adam): 5e-5, 3e-5, 2e-5 Number of epochs: 2, 3, 4

From which I tried out a few combinations. One question that I have is whether the number of epochs is the same as num_layers on the code? I also noticed that you recommended running --warm_start. Is that with the model in the models file you provided? Best accuracy achieved was: -@1: 0.33 -@20: 0.94

Stefanos-stk commented 4 years ago

I am dealing again with a disk quota issue. I can neither make directories or login via vscode (which is basically making a directory). I deleted a lot of files about 2gb of it, from cache and the pretrained dataset from nvidia since we are not using it anymore. I am checking to see if there is anything else I can delete, because I am pretty sure that even with the addition of the train,test,val sets I should have plenty of space to work with.

I am getting this for df -i:

Filesystem        Inodes   IUsed     IFree IUse% Mounted on
udev            32961483     831  32960652    1% /dev
tmpfs           32971051    1626  32969425    1% /run
/dev/nvme0n1p2 117178368  843821 116334547    1% /
tmpfs           32971051       2  32971049    1% /dev/shm
tmpfs           32971051       4  32971047    1% /run/lock
tmpfs           32971051      18  32971033    1% /sys/fs/cgroup
/dev/nvme0n1p1         0       0         0     - /boot/efi
/dev/sda1      793501696 3174338 790327358    1% /data
tmpfs           32971051      10  32971041    1% /run/user/1003
tmpfs           32971051      45  32971006    1% /run/user/1063
tmpfs           32971051      10  32971041    1% /run/user/1000

and also:


ssaa2018@lambda-server:~$ quota
Disk quotas for user ssaa2018 (uid 1063):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
 /dev/nvme0n1p2 18014104* 12582912 16777216   5days   34944       0       0     
mikeizbicki commented 4 years ago

I've given you an extra 30 gb of disk space.

One thing to watch out for is that all the models that you're training take up a lot of space (probably ~500mb each).

On Sun, 2020-06-28 at 07:08 -0700, Stefanos-stk wrote:

I am dealing again with a disk quota issue. I can neither make directories or login via vscode (which is basically making a directory). I deleted a lot of files about 2gb of it, from cache and the pretrained dataset from nvidia since we are not using it anymore. I am checking to see if there is anything else I can delete, because I am pretty sure that even with the addition of the train,test,val sets I should have plenty of space to work with. I am getting this for df -i: Filesystem        Inodes   IUsed     IFree IUse% Mounted on udev            32961483     831  32960652    1% /dev tmpfs           32971051    1626  32969425    1% /run /dev/nvme0n1p2 117178368  843821 116334547    1% / tmpfs           32971051       2  32971049    1% /dev/shm tmpfs           32971051       4  32971047    1% /run/lock tmpfs           32971051      18  32971033    1% /sys/fs/cgroup /dev/nvme0n1p1         0       0         0     - /boot/efi /dev/sda1      793501696 3174338 790327358    1% /data tmpfs           32971051      10  32971041    1% /run/user/1003 tmpfs           32971051      45  32971006    1% /run/user/1063 tmpfs           32971051      10  32971041    1% /run/user/1000 and also:

ssaa2018@lambda-server:~$ quota Disk quotas for user ssaa2018 (uid 1063):      Filesystem  blocks   quota   limit   grace   files   quota   limit   grac e  /dev/nvme0n1p2 18014104* 12582912 16777216   5days   34944       0       0     

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Stefanos-stk commented 4 years ago

json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 350 (char 349), I am getting this error when trying to train the model. I made all the adjustments you talked about in the email. I am guessing it has something to do with the way I am loading the JSON files. I have been kind of stuck at this problem for now hopefully I should be able to fix it. I don't know if it matters but I am making the code skip the first line because there is a string: "Timing is on". I am not sure me skipping that creates an issue.

Also, I have deleted the previous models.

Stefanos-stk commented 4 years ago

Here are some notes that I took from reading the following papers:

--Will include a citation from the emoji paper for the way we dealt with duplicates

At this time I am letting this run: python3 names_transformers.py --data data/prelim_emoticons_train.jsonl.gz --data_format headlines --model bert --batch_size 128 --learning_rate 2e-5 --gradient_clipping --optimizer adam --train. It seems that it takes some time for the data to load. (GPU=0)

Tomorrow planning on reading 2 more of the papers and running the prelim models. Finally, I am going to make some small addition to the latex file. I haven't pushed anything so far.

Stefanos-stk commented 4 years ago

kant2018practical: They claim that large-scale unsupervised language modeling combined with finetuning offers a practical solution to Multi-emotion sentiment classification, including those with label class imbalance and domain-specific context. -Train an attention-based transformer network on Amazon data 40GB. -Transformers out=perform the mLSTM model -Finetuning significantly improves the performance for both mLSTM \& Transformers -They expose the model to many contexts.

mohammad2015sentiment: -Stance towards a target by using negative or positive language. -They provide a dataset of tweet-target pairs annotated for both stance and sentiment. Contributions: -Created a Stance Dataset: (4k tweets): towards Atheism, Hillary Clinton etc -Visualizer for the Stance Dataset: -Organized a Shared Task Competition on Stance. -Stance Detection system: F-SCORE OF 70.3, where they used linear-kernel SVM classifier -Did some exploratory research questions. -Word embeddings: derived 100-dimensional word vectors using Word2Vec Skip-gram model.

yang2015twitter: -Measuring Emotional Contagion in Social Media: a recent study hypothesized that emotions spread online, even in absence of non-verbal cues typical of in-person interactions, and that individuals are more likely to adopt positive or negative emotions if these are over-expressed in their social network. -They raise some ethical concerns as they require massive-scale content manipulation with unknown consequences for the individuals therein involved. -They devise a null model that discounts some confounding factors. -identify two different classes of individuals: highly and scarcely susceptible to emotional contagion. -They mention the limitations of observational experiments: (i) the presence of confounding factors, including network effects like homophily and latent homophily, may affect the size of the observed effects; (ii) even the state of the art among sentiment analysis algorithms, like SentiStrength here employed, is not able to capture complex language nuances such as sarcasm or irony; (iii) finally, emotional contagion may be mixed to other emotional alignment effects, such as empathy or sympathy. In the Discussion section, we detail these issues. -They worked with SentiStrength to annotate the tweets with pos or negative scores. It is designed for short informal texts with abbreviations and slang. -Their goal was to establish a relation between the sentiment of a tweet and that of the tweets that its author may have seen in a short time period preceding its posting.

hemmatian2017survey: I think I have to pay a subscription to read this one.

Stefanos-stk commented 4 years ago

I am continuing to read other papers about emoji prediction: Towards Understanding Creative Language in Tweets: -Used a transfer learning model based on Bert -They had 2 tasks: 1)Irony Detection task 2)Emoji Prediction task: 38.52 F-Score. -Problem: the knowledge the models gain is entirely from the corresponding training data, such training mechanism may work well for traditional text genres with formal sentences; however; it usually achieves an unsatisfiable performance with informal text, such as social media text. -Argue that transfer learning can be used to solve the above issue -Emojis can carry a lot of meaning for instance the "pray" emoji intends to mean pray but it has been used as high five. -They used the ekphrasis tool as a preprocessing tool: It can perform tokenization, word normalization, word segmentation and spell correction for Twitter. -Hyperparameters: Max_sequ_length:128 Train_batch_size:32 Learning_rate:2e-5 Num_training_epochs:3 Number_of_labels (Emoji PredictionTask):20 Number_of_labels (Irony Detection Task A):2 Number_of_labels (Irony Detection Task B):4 Pre-trained BERT model: Bert-base-uncased Optimizer: BERT Adam Lower case: True -Data: 550K tweets ( 500K training, 50k testing) -Issues: Overfitting due to the lack of more training data to fine-tune the system, Time complexity. NTUA-SLP at SemEval-2018 Task 2: Predicting Emojis using RNNs with Context-aware Attention -Proposed architecture LSTM with an attention mechanism, initialize the embedding layer w/word2vec. -Trained end-to-end with back-propagation. -They have a cool Figure 1 sliding window emoji prediction example -20 most common emoji's -550 total million English tweets. -ekprahsis tool for preprocessing -bayesian optimization -F1-score: 35.361% -Figure 7 attention visualizations

SemEval 2018 Task 2: Multilingual Emoji Prediction: -Multilingual: Spanish & English -Only tweets with a single emoji are included in the task datasets. -Tweets located with Twitter API w/geo loc in the US: Trial: 50K, Training: 500k, Test: 50k, or in Spain:Trial: 10K, Training: 100k, Test: 10k -English 35.99 F1: SVM classifier with bag-of-n-grams features, Spanish 22.36 F1: SVM classifier with bag-of-n-grams features BY Tubingen-Oslo

Stefanos-stk commented 4 years ago

So the error that I was getting was the following: RuntimeError: invalid argument 5: k not in range for dimension at /pytorch/aten/src/THC/generic/THCTensorTopK.cu:23

A quick fix I found was to just delete the: 5,10,20 accuracies from the ks list.

    # get accuracy@k
        ks = [1,5,10,20]
        k = max(ks)

I am getting some incredible good accuracies for @1 which leads me to believe that something is wrong probably overfitting due to less data: 1m and big batch size number: 256. Going to update this thread with a link to the accuracies in a bit. https://tensorboard.dev/experiment/5nTDMMn4Q7665Gm7pqZe1A/

Stefanos-stk commented 4 years ago

I have been trying to collect the 80 emoji categories using the way you collect them in the code. However, since I am running it on the train set which is quite big I have not been able to get them. It has been running for 4 hours now. I don't know if there is a better way of getting the 80 categories. (online would probably be different than our set due to geoloc restrictions). I have also changed the way I load-lines into the model. I am doing it sequentially and one at a time (I don't load into any list). Finally, for evaluating the only change I have made to the code is to include an if statement inside the --train portion.

Stefanos-stk commented 4 years ago
  with gzip.open(args.data,'rt') as f:
        for (step,line) in (range(1, args.samples + 1),f):

So under the train section I am trying to have 2 for loops running at the same time. One of them is the JSON file and the other one is the range of samples (10k++). I am trying to combine those 2 since I am trying to save time loading the lines only when they are needed.

        for (step,line) in (range(1, args.samples + 1),f):

            # get random training example
            categories = []
            lines = []
            for i in range(args.batch_size):
                if args.sample_strategy == 'uniform_category':
                    category = random.choice(all_categories)
                    line = random.choice(category_lines[category])
                elif args.sample_strategy == 'uniform_line':
                    line, category = random.choice(lines_category)
                elif args.sample_strategy == 'linear_choice':
                    try:
                        tweet = json.loads(line)
                    except json.decoder.JSONDecodeError:
                        continue
                    text = tweet['text']
                    text = format_line(text)  
                    line = text
                    category = emoji = tweet['emoji']

                categories.append(all_categories.index(category))
                lines.append(line)

I am getting this error: ValueError: too many values to unpack (expected 2) citing the line with the double for loop. Do you have any idea about how to fix it? Thank you (Linear choice is the strategy)

mikeizbicki commented 4 years ago

The emoji categories are the same for each of the datasets, so you can use the validation set to collect the categories.

On Tue, 2020-07-07 at 08:54 -0700, Stefanos-stk wrote:

I have been trying to collect the 80 emoji categories using the way you collect them in the code. However, since I am running it on the train set which is quite big I have not been able to get them. It has been running for 4 hours now. I don't know if there is a better way of getting the 80 categories. (online would probably be different than our set due to geoloc restrictions). I have also changed the way I load-lines into the model. I am doing it sequentially and one at a time (I don't load into any list). Finally, for evaluating the only change I have made to the code is to include an if statement inside the --train portion. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Stefanos-stk commented 4 years ago

Quick reply: I got the 80 emojis after about 4 ish hours of waiting, so I have the categories right now!

mikeizbicki commented 4 years ago

Your line:

        for (step,line) in (range(1, args.samples + 1),f):

I think should be

        for step,line in enumerate(f):

That should fix the error message.

I would also completely take out any checks for ending the training early, so ignore the samples input parameter. I introduced this in class, but you won't be using it in your training at all. I want you to always be inspecting the tensorboard output and stopping the process manually. If the process ever stops early that means we're just wasting GPU until you start a new process.

One thing you will need to do, however, is ensure that the code doesn't stop after going through the input file once. To do this, just wrap the whole thing in a while loop:

while True:
    with gzip.open(args.data,'rt') as f:
        for step,line in enumerate(f):
Stefanos-stk commented 4 years ago

I am having an issue with loading the JSON data, before when we loaded the json's text into a list, we avoided the issue of: raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) by applying the try, expect blocks. Now, however, the way I am loading the text is more direct (inside the for loop for batch size). By applying those blocks again, I am missing on data and the batch list is getting filled with null data. The easy temporary fix would be to make the for loop run until there are enough tweets in the batch. I am not sure if this is the best fix since a lot of data is lost. Should I try different ways of loading the json file?

mikeizbicki commented 4 years ago

What you've written is correct.

My memory is that there's not very many lines that have formatting errors and need the try/except block, just the first and last lines. I should have this fixed for you as well with the final version of the dataset (which will hopefully be done by Friday).

Stefanos-stk commented 4 years ago

Recap of the week: I believe I have completed almost everything I wanted to do by today which includes: -I have filled out the Related Works are of the paper with brief summaries highlighting the important details that correlate with our project (about 1,25 pages long) -Included an Emoji prediction related works paragraph that contains brief summaries of 3 papers. -Made some minor addition (explanations) in the paper -Planned the 5 figures for the paper -Have committed the changes -Emojis now get imported rather than scanning the whole file for them (saving a lot of time) (the way I dealt with that seems to be very primitive but I believe it works alright for now: saved all the emojis into txt file; and just importing them & making them the all_categories list) -Made some adjustments to the code so it becomes easier to use over time/ faster -The sentences don't get loaded into a list and then into the model, the go straight into the model ( when creating the batches) -JSON error is the one from before (json.decoder.JSONDecodeError) (I am not quite sure if understood your answer correctly: Will this error not show up in the final training set, or should I have try,except blocks in place) -I think the code is ready to receive the training set -Is there something on the paper I write that can be completed before running the model (Correlation between plutchik wheel and emojis, I could not find any papers stemming from that image you sent me) -Graphs that will contain how many tweets we have for each language; is that something I can extract from the database? -Citations on latex ( I would really appreciate if you showed me one example of how to fix the '???' marks when trying to cite!) -Sorry for the long thread!

Hyperparameters that I am going to use: Batch_size = 64,128,256 Optimizer =Adam Gradient_clipping = True Learning Rate = 1e-4, 3e-4, 3e-5

mikeizbicki commented 4 years ago

I've added the final data into your project/data folder.

mikeizbicki commented 4 years ago

To address your questions:

  1. The data folder contains 3 csv files with summary statistics of the data that you can generate plots for while waiting on training.
  2. The dataset should be fixed so that the try/except blocks are no longer needed.
  3. To fix the latex ???, you have to add entries in the bibtex file. I'll show you how when we meet next.