Stefanos-stk / Bertmoticon

Multilingual Emoticon Prediction of Tweets about COVID-19😷
7 stars 3 forks source link

Blog #2

Closed Stefanos-stk closed 3 years ago

Stefanos-stk commented 4 years ago

-5/25/2020 Created the repo for the Covid19Twitter project

Stefanos-stk commented 4 years ago

I forgot to ask you about the previous paper you wrote (for the top 10 languages). I started testing sgd optimizer as well and it seems to be performing better already than adam (lr = 1e-3). I started another run with a warm start that is barely increasing the accuracy. Also got the main.bib to work, VS just had to reload and it fixed the issue. A question I have is regarding how to get the accuracies for each language individually, is the best way to go about it to keep the language tag all the way through, and save them into different lists to find their accuracies/ f-1 scores ?

Stefanos-stk commented 4 years ago

I am getting an error when trying to train inside bertfine tuning : e.py", line 847, in load_state_dict self.__class__.__name__, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for BertFineTuning: Missing key(s) in state_dict: "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", followed by a plethora of beer. stuff . Not sure how to fix it, I created another args parser in order to control where to train inside or outside

Stefanos-stk commented 4 years ago

This is the code I have so far:


model_name = 'bert-base-multilingual-uncased'
tokenizer = transformers.BertTokenizer.from_pretrained(model_name)
if args.train_where == 'outside':
    bert = transformers.BertModel.from_pretrained(model_name)
    print('bert.config.vocab_size=',bert.config.vocab_size)
class BertFineTuning(nn.Module):
    def __init__(self):
        super().__init__()
        if args.train_where == 'inside':
            self.bert =  transformers.BertModel.from_pretrained(model_name)
        embedding_size = args.hidden_layer_size
        self.fc_class = nn.Linear(768,len(all_categories))

    def forward(self,x):
        input_ids, attention_mask = x
        if args.train_where == 'inside':
            last_layer,embedding = self.bert(input_ids)
        else:
            last_layer,embedding = bert(input_ids)
        embedding = torch.mean(last_layer,dim=1)
        out = self.fc_class(embedding)
        return out, None
Stefanos-stk commented 4 years ago

emojitable.pdf

had to add a 5th decimal point for the last 2 since it was all zeros untill that one

I have put the image right next to the emoji wheel, latex is very weird when trying to put images side to side, I have managed to find some scales where both images fit next to each other and they look normalish but when i try to add labels or captions they immediately go back on being one of top of each other. Not the major concern right now but just a detail i have to take care eventually. (Going to push the changes)

Stefanos-stk commented 4 years ago

So I am trying to debug how to make a stop flag to get the accuracies (no luck with that still) but I noticed something interesting. It seems that we are losing about 1 line per 64 lines due to the JSON decoder error. I am printing out the step and the error count: (step is just the number)

3
error : 1
7
error : 2
7
error : 3
8
error : 4
8
error : 5
10
error : 6
11
error : 7
15
error : 8
17
error : 9
17
error : 10
18
error : 11
19
error : 12
23
error : 13
25
Stefanos-stk commented 4 years ago

Question for the F1 scores. It seems that the way I get the F1 scores at the moment is that for every batch it picks the best performing sample? and I input it into the y_pred dictionary


                if step % args.print_every == 0 or True:

                    # get category from output
                    top_n, top_i = output_class.topk(1)
                    guess_i = top_i[-1].item()
                    category_i = category_tensor[-1]
                    guess = all_categories[guess_i]
                    category = all_categories[category_i]

                    y_true.append(category)
                    y_pred.append(guess)

Is this correct or should I calculate for every element in the batch? Given the fact that we are trying to get accuracies for languages with small representation, I assume we have to do the latter. If so I am not sure how to get every element and its model prediction.

mikeizbicki commented 4 years ago

There was an error in how the " was being escaped in the json files. It was being written as \\" instead of \". I have fixed the problem, and renamed the fixed files to include the word fixed inside them like so: tweet_emoji_dataset_fixed_train.jsonl.gz.

mikeizbicki commented 4 years ago

You're correct that you can't just calculate 1 point per batch, but must do it for every point. The easy way to do this is loop over each point in the batch, and access that point by changing the -1 index to the loop counter. This way, you can also check the language and country to append to the appropriate lists as well.

mikeizbicki commented 4 years ago

Related to the fixed dataset, you should run another training pass over them again, but this time use the fixed data warm starting from the previous runs. Also, be sure to decrease the learning rate.

Stefanos-stk commented 4 years ago

I had 2 runs going on: lr: 1e-06 with warm start (one with the fixed dataset and one with the old one) both of them were training over-night doing significantly better than previous runs but it seems that the gpu stopped working ( I tried checking nvidia-smi) and I got an error. I think they have transferred to the cpu, I am not sure if they are still learning since the learning rate is really small. At the moment fixing the F1 scores, created the loop for calculating all the points, trying to fix the country code categories and the lang categories f1 scores.

Stefanos-stk commented 4 years ago

I tried initiating the dictionaries like this:


    langs = defaultdict(lambda: {})
    langs = {
        'true': defaultdict(lambda: []),
        'pred': defaultdict(lambda: []),
        }
    countries =defaultdict(lambda: [])
    countries ={
        'true': defaultdict(lambda: []),
        'pred': defaultdict(lambda: []),
    }

However when trying to append an element inside I am getting this issue:

    print(lang['true'][languages[xs]])
TypeError: string indices must be integers

languages[xs] returns 'en' 'es' etc..

mikeizbicki commented 4 years ago

We can see from the error that the error is related to lists, not dictionaries, so the way you defined langs and countries shouldn't be an issue. I'm guessing that languages is a list and xs is also a list; you need something like languages[0] or languages[i] where i is an integer.

Stefanos-stk commented 4 years ago
                    for xs in range(args.batch_size):
                        top_n, top_i = output_class.topk(1)
                        guess_i = top_i[xs].item()
                        category_i = category_tensor[xs]
                        guess = all_categories[guess_i]
                        category = all_categories[category_i]
                        y_true.append(category)
                        y_pred.append(guess)
                        print(languages[xs])
                        print(lang['true'][languages[xs]])
                        lang['true'][languages[xs]].append(category)
                        lang['pred'][languages[xs]].append(guess)

                        countries['true'][country_codes[xs]].append(category)
                        countries['pred'][country_codes[xs]].append(guess)

This is how I append to the dictionary, where xs is an int. I think the way I append is correct, right?

Stefanos-stk commented 4 years ago

Nvidia-smi is still not working I am getting this error:

ssaa2018@lambda-server:~$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
mikeizbicki commented 4 years ago

Oooff. It should be fixed again. Hopefully for good.

On Wed, 2020-07-22 at 13:21 -0700, Stefanos-stk wrote:

Nvidia-smi is still not working I am getting this error: ssaa2018@lambda-server:~$ nvidia-smi Failed to initialize NVML: Driver/library version mismatch — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Stefanos-stk commented 4 years ago

I fixed the other issue with the indices I was accidentally referring to a temp list called lang, the actual list to append to is called lands (I have to find better names for variables)

Stefanos-stk commented 4 years ago

All 8 gpu are running at the moment 6 train 2 validate.

Stefanos-stk commented 4 years ago

I have fixed the sliding window algorithm it now works just fine. I am trying to create the figure with the 2 columns (one the top 10 languages (using the sliding window image)) ( one with the predicted emojis). However, I am running to some issues. I am using: "I feel so tired and bored #quarantine" as the sentence, but when I try to get the result I am getting this error:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 8.24 GiB already allocated; 2.56 MiB free; 9.92 GiB reserved in total by PyTorch)

When I was working to fix the sliding window algorithm the I could get them to print out the images; for some reason now it only manages to print the char one, not the line one. I am guessing the model I am using is pretty big (8gbs) and there is not space left for the image. However I noticed that for some reason it reserves 10 GB for pytorch which leaves at least 1gb of memory not used, is there a way I can limit pytorch's space? line0000 char

Also, since I am not going to be here for the next week I was planning to let the models we talked about run whilst I am away & is it okay if I read the paper for Monday and post my thoughts about it? & Is it okay to read Nate's draft paper next Saturday and fill out the feedback sheet by Monday?

For now also I am just going to make the graph for the "char" results of the sliding window alg. I am using the 11th most used language since the 2nd is Undefined. Also the sliding window can't handle Arabic.

I have uploaded my current results in a txt file attached here: ( I am not sure if the emojis can be depicted) lemojis.txt

mikeizbicki commented 4 years ago

When I was working to fix the sliding window algorithm the I could get them to print out the images; for some reason now it only manages to print the char one, not the line one. I am guessing the model I am using is pretty big (8gbs) and there is not space left for the image.

Images don't take up very much space (max 10M, not 1G), and they're not stored on the GPU, so this is not a problem. My guess is that the problem is that you are creating a large "batch" of modified data and storing all of that on the gpu. To fix it:

  1. you can break it up into multiple batches ensuring that memory gets freed between each batch.
  2. you can also wrap everything within with torch.no_grad() to ensure that gradients are not getting computed for each data point since these double the amount of memory needed

Also, since I am not going to be here for the next week I was planning to let the models we talked about run whilst I am away & is it okay if I read the paper for Monday and post my thoughts about it? & Is it okay to read Nate's draft paper next Saturday and fill out the feedback sheet by Monday?

Yes.

Also the sliding window can't handle Arabic.

It should be able to do Arabic just fine, but the way it's visualized is a bit different.

Stefanos-stk commented 4 years ago

It seems that some languages behave sort of differently with the negative emotions that I tried to convey using the specific phrase, I would like to believe that our results are good? Most of the languages capture the right emotion besides Arabic and Hindi which sort of include one emoticon that applies to the phrase. Also I noticed that some emojis with a small representation (0.0052) were able to make it into the top 5 lists.Table-wise I tried to squeeze in everything as tight as possible, I decided to include the line dividers for easier comprehension. tableemoji.pdf

Stefanos-stk commented 4 years ago

So I have started 5 experiments 3 are already running:

CUDA_VISIBLE_DEVICES=3 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --gradient_clipping --batch_size 64 --learning_rate 1e-3 --optimizer sgd --train --train_where inside > nohup/sgd_clip &  
CUDA_VISIBLE_DEVICES=4 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64 --learning_rate 1e-3 --optimizer sgd --train --train_where inside > nohup/sgd_no_clip &  
CUDA_VISIBLE_DEVICES=5 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --warm_start log/warm_starts/warmed_lr5_adam --model bert --batch_size 64 --learning_rate 1e-6 --optimizer adam --train --gradient_clipping --train_where inside > nohup/adam_warm_clip_lr6 &  
CUDA_VISIBLE_DEVICES=6 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --warm_start log/warm_starts/warmed_lr5_adam --model bert --batch_size 64 --learning_rate 1e-6 --optimizer adam --train --train_where inside > nohup/adam_warm_no_clip &  
CUDA_VISIBLE_DEVICES=7 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --warm_start log/warm_starts/warmed_lr5_adam --model bert --batch_size 64 --learning_rate 1e-7 --optimizer adam --train --gradient_clipping --train_where inside > nohup/adam_warm_clip_lr7 &  

In short: testing sgd with clip and without, adam warm started with clip & without, adam warm started with learning rate 1e-6 & 1e-7 with clip both

Stefanos-stk commented 4 years ago

roses are red violets are blue the models didn't run I am sad now too

I am restarting them hopefully by Wednesday I'll get some good results. Also question for the corona dataset: Can I start running the corona dataset by Wednesday, and could you sent me some brief instructions on how to access it? thank you!

Stefanos-stk commented 4 years ago

Thankfully, the experiments have already done pretty well over-night and all of them seem to have converged around @1: ~0.30

Stefanos-stk commented 4 years ago

So it turns out that the formatting of the line messes up the arabic and the other non-latin characters. I am going to fix the table now with the correct sliding window alg, but since I am using that formating function for every line in the model does that mean that the results are affected by that change? line0012 char

mikeizbicki commented 4 years ago

That's what the arabic is suppoosed to like.

There shouldn't be any formatting of the line before you pass it to the model, so I don't know what you're referring to? If you are doing some sort of preprocessing of the string before passing it to bert, then this will likely cause a drastic performance reduction for all the non-English langauges.

As for running on the corona dataset, we can't do that until we have the model fully trained.

Stefanos-stk commented 4 years ago

I am referring to this function here that I was using for passing stuff into bert. It turns out the unicode(line) messes up everything, so that's bad for everything that was running all along. Fixed it but should I start running models again right? Thats why we were losing all the accents in french, greek and other non-latin characters.

def format_line(line):
    line = unidecode(line)
    line = demoji.replace(line)
    line = re.sub(r"(@\S*)", "@", line)
    line = re.sub(r"http\S*", "url",line)
    return line
mikeizbicki commented 4 years ago

Yes, the call to unidecode is messing everything up for non-English languages, and you'll unfortunately have to retrain everything from the beginning. Remember that this training first involves training the last layer only, and then warm starting the full model. On the plus side, you should expect to get significantly better performance.

I would also change

    line = re.sub(r"http\S*", "url",line)

to

    line = re.sub(r"http://\S*", "url",line)

as the first line will occasionally match against things that are not urls, like firefox made an http request to google.com. (But that's an extremely minor point that may not even occur in the data.)

I wouldn't delete the old models at this point though unless you need space. There's a slight chance that we can still use them for something.

Stefanos-stk commented 4 years ago
CUDA_VISIBLE_DEVICES=0 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64 --learning_rate 1e-4 --log_dir_base log_correct --gradient_clipping --optimizer adam --train --train_where outside > nohup_last/adam_last_1e4 &  
CUDA_VISIBLE_DEVICES=1 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64 --learning_rate 1e-5 --log_dir_base log_correct --gradient_clipping --optimizer adam --train --train_where outside > nohup_last/adam_last_1e5 &
CUDA_VISIBLE_DEVICES=2 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64 --learning_rate 1e-6 --log_dir_base log_correct --gradient_clipping --optimizer adam --train --train_where outside > nohup_last/adam_last_1e6 & 

CUDA_VISIBLE_DEVICES=3 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64 --learning_rate 1e-5 --log_dir_base log_correct --optimizer adam --train --train_where outside > nohup_last/adam_last_1e5_noclip &

CUDA_VISIBLE_DEVICES=4 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64 --learning_rate 1e-2 --log_dir_base log_correct --gradient_clipping --optimizer sgd --train --train_where outside > nohup_last/sgd_last_1e2 &
CUDA_VISIBLE_DEVICES=5 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64 --learning_rate 1e-3 --log_dir_base log_correct --gradient_clipping --optimizer sgd --train --train_where outside > nohup_last/sgd_last_1e3 & 

CUDA_VISIBLE_DEVICES=6 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64 --learning_rate 1e-2 --log_dir_base log_correct --optimizer sgd --train --train_where outside > nohup_last/sgd_last_1e3_noclip & 

In short: -Cuda 0,1,2: adam,clip,lr=1e-4,1e-5,1e-6,last layer -Cuda 3: adam,no_clip, lr=1e-5,last layer -Cuda 4,5: sgd,clip, lr=1e-2,1e-3,last layer -Cuda 6: sgd,no_clip,lr=1e-3,last layer -Cuda 7: nothing; use it for validation

These are the experiments that I am running. Next, I will pick the best-performing ones and use them as arm start to train all the layers. Validation is ready & exporting the results to csv is ready, the infer is fixed and ready to replace the table I already created. So it should only be an issue of time. In the mean-time, I am reading about writing a paper and the notes you gave us. Sorry about this mess-up I should have been more careful with the initial tests.

Question about the experiments: Should I run validations eventually on all the experiments done above so that a table includes f1 scores for training with different hyperparameters & training all the layers or just the last one? If so should the table be like Nate's table 1 on his draft?

Stefanos-stk commented 4 years ago

After 160k steps of 64 batches we have that w/smoothing:0.99 @ 1 : -adam_lr=1e-6_clip: 0.2539 -adam_lr=1e-5_clip: 0.2651 -adam_lr=1e-4_clip: 0.2726 -adam_lr=1e-5_no_clip: 0.2715

-sgd_lr=1e-2_clip: 0.2604 -sgd_lr=1e-3_clip: 0.271 -sgd_lr=1e-3_no_clip: 0.1206

In other words they are doing already much better.

Stefanos-stk commented 4 years ago

Run no. 2: So I stopped the 4 experiments you mentioned and I added the 3 following:

CUDA_VISIBLE_DEVICES=1 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 1e-5 --log_dir_base log_correct --optimizer adam --train --train_where outside --warm_start log_correct/warm_starts/model946196 > nohup_last/adam_last_1e5_noclip_warm_started &
CUDA_VISIBLE_DEVICES=2 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 1e-5 --log_dir_base log_correct --optimizer adam --train  --gradient_clipping --train_where outside --warm_start log_correct/warm_starts/model946196 > nohup_last/adam_last_1e5_clip_warm_started &

CUDA_VISIBLE_DEVICES=4 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 1e-5 --log_dir_base log_correct --optimizer adam --train --gradient_clipping --train_where inside --warm_start log_correct/warm_starts/model946196 > nohup_last/adam_all_1e5_clip_warm_started &

In short: -Cuda 1: lr:1e-05,adam,no clip, warm started from the best performing model which had a lr = 1e-04 -Cuda 2: same with clip -Cuda 4: same with clip, training the entire model

Stefanos-stk commented 4 years ago

Added 2 additional runs, waiting to see how they go so I can stop other runs and improve on models learning in the entire model:


CUDA_VISIBLE_DEVICES=6 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 1e-6 --log_dir_base log_correct --optimizer adam --train --gradient_clipping --train_where inside --warm_start log_correct/warm_starts/model849946 > nohup_last/adam_all_1e6_clip_warm_started &
CUDA_VISIBLE_DEVICES=7 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 1e-6 --log_dir_base log_correct --optimizer adam --train --gradient_clipping --train_where outside --warm_start log_correct/warm_starts/model566123 > nohup_last/adam_last_1e6_clip_warm_started &

In Short: -Cuda 6: lr 6, adam, all, warm started -Cuda 7: lr 6 adam last warm started (trying to achieve max accuracy with only last layer)

Stefanos-stk commented 4 years ago

Running the last experiments:

CUDA_VISIBLE_DEVICES=0 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 1e-7 --log_dir_base log_correct --optimizer adam --train --gradient_clipping --train_where inside --warm_start log_correct/warm_starts/model792907 > nohup_last/adam_all_1e7_clip_warm_started &

CUDA_VISIBLE_DEVICES=1 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 3e-7 --log_dir_base log_correct --optimizer adam --train --gradient_clipping --train_where inside --warm_start log_correct/warm_starts/model792907 > nohup_last/adam_all_3e7_clip_warm_started &

CUDA_VISIBLE_DEVICES=2 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 5e-7 --log_dir_base log_correct --optimizer adam --train --gradient_clipping --train_where inside --warm_start log_correct/warm_starts/model792907 > nohup_last/adam_all_5e7_clip_warm_started &

CUDA_VISIBLE_DEVICES=3 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 1e-6 --log_dir_base log_correct --optimizer adam --train --gradient_clipping --train_where inside --warm_start log_correct/warm_starts/model849946_2 > nohup_last/adam_all_1e6_clip_warm_started &

CUDA_VISIBLE_DEVICES=4 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 3e-6 --log_dir_base log_correct --optimizer adam --train --gradient_clipping --train_where inside --warm_start log_correct/warm_starts/model849946_2 > nohup_last/adam_all_3e6_clip_warm_started &

CUDA_VISIBLE_DEVICES=5 nohup python3 -u names_transformers.py --data data/tweet_emoji_dataset_fixed_train.jsonl.gz --model bert --batch_size 64  --learning_rate 5e-6 --log_dir_base log_correct --optimizer adam --train --gradient_clipping --train_where inside --warm_start log_correct/warm_starts/model849946_2 > nohup_last/adam_all_5e6_clip_warm_started &

Basically just warm starting from previous models with different learning rates

Stefanos-stk commented 4 years ago

Quick clarification about the main function of the PyPi package: As I understood it, it should be a function that takes in either a list of strings or a string itself (changing that string into a list of strings), it also takes in the number of guesses for the model with a default value of 80 (number of categories), and returns a list of dictionaries (each dictionary has the emojis as the keys and the percentages as the values). Each entry of the list is the corresponding sentence.

mikeizbicki commented 4 years ago

Yes, that's correct.

Stefanos-stk commented 4 years ago

My other question has to do with the way I can collect that data from the corona_dataset while using the model. I was thinking that the easiest way to gather the information would be the following: The data should look like this: day 1: emoji2, emoji4, emoji3, emoji9 ... day 2: emoji1, emoji3, emoji9, emoji2 ... ... The rows would indicate from the most frequently used emoji to the least frequently used emoji. That been said is seems it would be easier (for me at least) to create a different function (basically the one I already had) that calculates those frequencies for each day. Is that okay for this task?

mikeizbicki commented 4 years ago

My recommendation is to break it down into smaller steps.

Step 1: calculate and save the emoji/percentages for each tweet as a json file

Step 2: take the json file as input and output summaries per day per emoji. The best way to summarize is to add all of the probabilities for an emoji, rather than trying to "discretize" the emoji into a 1 or a 0. Output as a csv.

Step 3: convert the csv into an image

The key thing about the function you already had is that it was doing too much work for a single function. You should have separate functions to do each of these steps.

Stefanos-stk commented 4 years ago

Sorry for not making it in the call. I will post the update here:

The tests runs have all completed, here are the results: -model849946: training the entire model, LR: 1e05, all 80 emojis:

'weighted avg': {'precision': 0.21142558463277342, 'recall': 0.30682289601597723, 'f1-score': 0.20972978750804427, 'support': 8716416}}

-model849946: training the entire model, LR: 1e05, top 15 emojis:

Stefanos-stk commented 4 years ago

I fixed the function for the pypi package it returns now a list of dictionaries: example: input:

x = infer_list(['je suis fatigue','Σημερά βαριέμαι πάλυ πολύ ','i love long walks on the beach and eating ravioli'],3)

ouput:

[{'😭': '0.3368', '😂': '0.1445', '😔': '0.0592'}, {'😂': '0.5090', '😍': '0.0428', '😜': '0.0340'}, {'😂': '0.1649', '😭': '0.1467', '😍': '0.1282'}]
Stefanos-stk commented 4 years ago

For tomorrow:

No questions as of right now

Stefanos-stk commented 4 years ago

emoji_prediction_table_fix1.pdf

I fixed the l2 issue and the table looks much cleaner showing only the important words. Haven't fixed the Japanese tokenizer yet, but that can be done later I might need some minor help with that one.

Stefanos-stk commented 4 years ago

final_graph So I spent a lot of time making this graph, I seperated the 80 emojis and created my own intepretation of it. I am a bit behind on the paper schedule but I think it was worth the time spending on this graph. The number of emojis in each category is somewhat balanced. I am trying to get the paper in draft form still by today, if I dont make it it will be ready 100% tomorrow (sorry for the delay)

mikeizbicki commented 4 years ago

That's a really slick looking figure!

Stefanos-stk commented 4 years ago

Thank you! This is how I divided the emoji dataset & attached the previous graph with the mask emoji percentage. I dont think I can have the paper ready today... I am done with all the graphs though so the way towards finishing the draft is much clearer. By tomorrow I will upload the draft. So sorry for the delay, I misjudged the time I needed for the graphs. final_graph

PLUTCHIKWHEEL.pdf

Stefanos-stk commented 4 years ago

I have figured out a way to create the confusion matrix however I have no clue how to use the emojis as labels in matplotlib, I have found some solutions but it seems that you need an iOS system to implement them. Do you know any fonts that support emojis in matplolib?

I have to install this first https://pycairo.readthedocs.io/en/latest/getting_started.html but it requires some sudo stuff which i am not familiar with

mikeizbicki commented 4 years ago

You should be able to use any font with emoji support that you'd like. You just have to load it the same way you load the Japanese fonts.

I've never actually done that before though.

On Thu, 2020-08-20 at 13:43 -0700, Stefanos-stk wrote:

I have figured out a way to create the confusion matrix however I have no clue how to use the emojis as labels in matplotlib, I have found some solutions but it seems that you need an iOS system to implement them. Do you know any fonts that support emojis in matplolib? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Stefanos-stk commented 4 years ago

These are the 2 confusion matrix I had generated, I am going to use a different coloring pattern so we can see more clearly the differences in lower percentages 10% - 30 %, etc. I could only generate some of the emoji labels. I am going to use a tabular environment in latex and add the labels that way. all80_confusion.pdf top15_confusion.pdf

Stefanos-stk commented 4 years ago

true_pred_rotated.pdf F1_LANG_EMOJIS.pdf Hello Mike these are 2/3 graphs I have to complete, I am trying to figure out the confusion matrix still but that should be done by the end of today. My latex for some reason is majorly broken. I can't compile and look at the pdf, I have tried numerous combinations to compile with pdflatex,luatex etc..., re-installed my latex library, nothing seems to be working, I can add text to the .tex file and still push it to github, however I cannot see the results nor any other errors. I think at this moment it is not worth me trying to fix it (spent 3 hrs yesterday). I ll try using overleaf. The graphs above are: the true vs predicted stacked bar graphs and the combined results table.

Regarding the mask emoji and the union we were looking for: I have gathered the real tweets that have the mask emoji and it seems that both pred and true have the same percentage (look at the 2nd graph attached). I am going to also collect the tweets are model predicted to be with a face mask, but the differences seem to be very subtle ( i think the model did pretty well with the mask emoji)

Stefanos-stk commented 3 years ago

so I have found some interesting tweets that might be good for the google translate table: (id) (text)

1219758712123822080,Yo guys 139 confirmed cases na ang Coronavirus sa Pinas. Please always wash your hands and wear a face mask. 1219759827875840001,"Washington Man Is 1st in US to Catch Newly Discovered Viral Pneumonia. Get out your face masks folks! #coronavirus 1219970417907183619,Flying for the weekend to #Germany should I be worried about #coronavirus?

I will probably modify them a bit to get good results across the languages

mikeizbicki commented 3 years ago

Those look great!

On Thu, 2020-09-10 at 13:22 -0700, Stefanos-stk wrote:

so I have found some interesting tweets that might be good for the google translate table: (id) (text) 1219758712123822080,Yo guys 139 confirmed cases na ang Coronavirus sa Pinas. Please always wash your hands and wear a face mask. 1219759827875840001,"Washington Man Is 1st in US to Catch Newly Discovered Viral Pneumonia. Get out your face masks folks! #coronavirus 1219970417907183619,Flying for the weekend to #Germany should I be worried about #coronavirus? I will probably modify them a bit to get good results across the languages — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Stefanos-stk commented 3 years ago

Hello Mike so I added all the update graphs/tables and the corrections on the main tex file. I uploaded the tex file that creates the f1 result table, but I wasn't able to do the corona dataset language list, i didnt find the data I had only the stats for the twitter emoticon, if you send me that i can do that tomorrow morning!