Stefanos-stk / Bertmoticon

Multilingual Emoticon Prediction of Tweets about COVID-19😷
7 stars 3 forks source link

Sentiment analysis function #1

Closed mikeizbicki closed 3 years ago

mikeizbicki commented 4 years ago

The first task is based on the code at https://github.com/NVIDIA/sentiment-discovery/ .

Currently, this code works on the command line only, but we need to adapt it into a python function. The function should look something like:

def get_sentiment( texts : [str], model : str) -> [NamedTuple]:

where the NamedTuple has 8 floating point, one for each type of sentiment, and there is one of these tuples per input text entry.

Eventually, we will probably want to place this function into a python package similar to https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis

Once this function works, we will apply it to all the tweets we have about coronavirus to measure how the sentiment about coronavirus has changed over time/place.

Stefanos-stk commented 4 years ago

Hello Mike,

I was trying to understand the documentation of the codebase, so I was downloaded the Finetuned Plutchik mLSTM which is for the 8 plutchik emotions (the ones we care about). I had a couple of issues with pytorch not being compatible but I fixed most of them and then I mainly played around with this call function: python run_classifier.py --data test.csv --text-key text --write-results output.csv --load mlstm_semeval.clf. Unfortunately, I could not get results because I am running it from a laptop without the required gpu. So my questions are:

-Should I go on the server and use the code there? -For the function I that I have to build I should use the python script/files provided in this NVIDIA sentiment package?

Thank you

mikeizbicki commented 4 years ago
  1. Yes, we want it to run on the server, and your account should have access to the GPUs.

  2. Yes, the function should just access the existing code in the right way. I recommend forking the nvidia repo and then just modifying the forked code directly.

Finally, we want to be able to use both the mlstm model and the transformers model in the library within the function. IIRC, the mlstm model was a bit easier to get working, but the transformers model gives better results.

Stefanos-stk commented 4 years ago

Server Question: Should I scp the codebase into the server, or is there a way to run on the GPU of the server from my computer?

Stefanos-stk commented 4 years ago

I have a good amount of tweets that I want to run the model on, however, I am getting 2 errors:

-ssaa2018@lambda-server:~/sentiment_discovery_nvidia/sentiment-discovery$ python3 setup.py install 
Segmentation fault (core dumped)

This is the error I get when I try to install the setup.py file required for the codebase. Since it is a seg-fault is it because I placed the files into a different directory than the base one?

I tried testing the pytorch version on root of my directory and I am getting the issue again:

ssaa2018@lambda-server:~$ python3 -c "import torch; print(torch.__version__)"
Segmentation fault (core dumped)

I tried clearing cache but I don't think I have permission to do that:

sudo sh -c "$(which echo) 3 > /proc/sys/vm/drop_caches"
[sudo] password for ssaa2018: 
ssaa2018 is not in the sudoers file.  This incident will be reported.
-Traceback (most recent call last):
  File "traverse-tweets.py", line 28, in <module>
    sentence = tweet['text'].lower()                                                        
KeyError: 'text'

This is an error I get after traversing through one day's tweets from the traverse.py file. I tried reading the Twitter API, but I could not find anything wrong with this implementation

Stefanos-stk commented 4 years ago

I got a small batch of tweets: around 17mb of data and tried running this:

ssaa2018@lambda-server:~/sentiment_discovery_nvidia/sentiment-discovery$ python3 run_classifier.py --data pilot.csv --text-key Tweets --write-results output.csv --load mlstm_semeval.clf
Segmentation fault (core dumped)

Causing the above error. pilot.csv is the file with the tweets, the --text-key Tweets is the column with the text. I am not sure how to understand this seg fault

Stefanos-stk commented 4 years ago

The generic function for the get_sentiment is almost ready however I am getting one error this is the function:

def get_sentiment(text,model):
        mf.make(text) 
        (train_data, val_data, test_data), tokenizer, args = get_data_and_args()
        args.load = model.lower() + '_semeval.clf'
        args.model = model
        model = get_model(args)
        args.data  = 'temp.csv'
        args.text_key = 'key'
        print(args)
        ypred, yprob, ystd = classify(model, train_data, args)
        print(args.classes)
        print("ypred",ypred)

The only issue I am getting is with train_data. I followed it throughout the other files and the function it goes through.

In the arguments.py file under the function

def add_run_classifier_args(parser):

there is a line :

data_group.set_defaults(split='1.', data=['data/binary_sst/train.csv'])

which basically sets the default path to one of their own dataset. I tried overwriting it with some other code but it did not seem to work. The way I am testing my function is through the main()

 get_sentiment(['I love life so much','I hate bad food'],'mlstm')
Stefanos-stk commented 3 years ago

'Washington Man Is 1st in US to Catch Newly Discovered Dangerous Pneumonia. Get out your face masks folks! #coronavirus #wuhan #mask',

The bold languages had the mask emoji

English Port Spanish Japanese Arabic Filipino Turkish French Thai Hindi Italian Russian Greek German Hebrew Korean Polish Indonesian

It does much better with latin based (besides hebrew)

mikeizbicki commented 3 years ago

The bold languages I think would make a great figure.