Test MelBERT metaphor identification on Russian corpora

PolinaZulik commented 2 years ago

run metaphor detection by MelBERT on 3 corpora: LCC, Yulia's verbs, wiktionary verbs.

First, reproduce the results obtained in the paper MelBERT: Metaphor Detection via Contextualized Late ... - arXiv on an English dataset, because in the paper no Russian data is used.
Find the best model for Russian: the authors suggest RoBERTa. look for the best similar one here.
Perform train/dev/test experiments using the same folds as I used in #9 - the corpora are already divided, use the train/dev/test folders or column indicating fold. if you use early stopping, please use patience for dev loss increase = 1. 3 experiments, check out results table.
Perform cross-corpora experiments: train and validate on one corpus, test on the training set of another corpus. - 6 experiments.

PolinaZulik commented 2 years ago

@Wheatley961 if you struggle to find English corpora to reproduce the results, I've come accross this paper, where they list all available corpora with download links on p.6.

Wheatley961 commented 2 years ago

@PolinaZulik , there are no problems in finding original corpora. :-) Here are some results, but I do need your help! First of all, it was necessary for the MelBERT algorithm to change the structure of input datasets. The first column is indices, the second one is classes (0 or 1), the third one is texts (sentences), the fourth one is pos, the latter one is word indices. That's why I created a code for saving all your datasets in new formats in new folders. They are in the folders with the _formatted indices. Then I adapted MelBERT for both original datasets and our ones (with some commentaries for you to simplify its usage. Our datasets are adapted with the vua algorithm. The problems are as follows:

1) The original datasets are time- and RAM-consuming, sometimes I need to wait for 15 or more hours to obtain the original results presented in the paper (just for one dataset). There are some samples provided when we install this GitHub repo to our coding environment, they are less in size, but F-measure and all the parameters are different (we don't need it). When I train and test full datasets, the RAM is over, there is a crash and I have to start the experiments from the very beginning. Do you have an opportunity to run experiments on your computer using these codes in Colab (I hope you have more RAM in your account)?

2) I use rubert-base-cased-conversational for our datasets. LCC and Yulia's files are fine, but when training, I face a RAM-crash in Colab again. I guess it can be easily eliminated with using another coding environment. The trick with Wiktionary verbs. While training, I got the following error: /content/MelBERT/run_classifier_dataset_utils.py in convert_examples_to_two_features(examples, label_list, max_seq_length, tokenizer, output_mode, args) 509 comma2 = tokenizer.tokenize(" ,")[0] 510 for i, w in enumerate(tokens): --> 511 if i < tokens_b + 1 and (w in [comma1, comma2]): 512 local_start = i 513 if i > tokens_b + 1 and (w in [comma1, comma2]): TypeError: can only concatenate list (not "int") to list I problem is with the Wiktionary train file. When I change it to the train file of Yulia, for instance, it works fine (but RAM again!). I have been reflecting on the problem for several days and digging in the codes and train files, but I lost any hope to get rid of the problem. If you have some free time to give some advice, it will be perfect.

PolinaZulik commented 2 years ago

The original datasets are time- and RAM-consuming, sometimes I need to wait for 15 or more hours to obtain the original results presented in the paper (just for one dataset). There are some samples provided when we install this GitHub repo to our coding environment, they are less in size, but F-measure and all the parameters are different (we don't need it). When I train and test full datasets, the RAM is over, there is a crash and I have to start the experiments from the very beginning. Do you have an opportunity to run experiments on your computer using these codes in Colab (I hope you have more RAM in your account)?

please use GPU - graphic processor unit! - on Colab. it also comes with more RAM. it's almost impossible to use BERT-based models without GPU - you'd have to set batch size to 1 and it'd take ages. I'll add instructions on GPU in colab below.

I use rubert-base-cased-conversational for our datasets. LCC and Yulia's files are fine, but when training, I face a RAM-crash in Colab again. I guess it can be easily eliminated with using another coding environment. The trick with Wiktionary verbs. While training, I got the following error:

I'll have a look, but for sure, the problem is in the format - it might be slightly different between Wiktionary and Yulia. namely, in your code the problem is that tokens_b is a list, not an integer! I think you just have to pre-format the tokens_b before line 513, while reading the dataset. I'll take a deeper look in the code.

PolinaZulik commented 2 years ago

Add GPU in colab:

PolinaZulik commented 2 years ago

I need to wait for 15 or more hours to obtain the original results presented in the paper (just for one dataset).

btw you don't have to check all the English datasets, just in case you're interested and you have the time. the goal of reproducing is to make sure we run the original model correctly and the results are the same. 1 dataset would be enough I think, 2 is more than enough.

PolinaZulik commented 2 years ago

2. /content/MelBERT/run_classifier_dataset_utils.py in convert_examples_to_two_features(examples, label_list, max_seq_length, tokenizer, output_mode, args) 509 comma2 = tokenizer.tokenize(" ,")[0] 510 for i, w in enumerate(tokens): --> 511 if i < tokens_b + 1 and (w in [comma1, comma2]): 512 local_start = i 513 if i > tokens_b + 1 and (w in [comma1, comma2]): TypeError: can only concatenate list (not "int") to list

unfortunately, I can't run the code currently, because I don't have the folder structure with the MelBERT code on colab. but I can see some important points:

in your data, you use the original non-splitted text from the original data - column sent. please use the text column instead - it has the spaces between words and punctuation marks needed to split the text easily. that's the only difference I can now see between Wikt and Yulia: the Yulia formatted data has the spaces necessary for splitting. btw, does the original vua dataset have them?
it seems the load_train_data has worked fine for wiktionary, has it?
if the first suggestion doesn't help, please add output to these lines. for example, between lines 512 and 513, add:

if type(tokens_b) != int:
   print(tokens_b)

, so that we can see which sentence causes the problem.

my bet is that (1) in wikt_formatted, you're using the original sentence column with no spaces between words and puncuation; (2) MelBERT dataset loader in run_classifier_dataset_utils.py uses spaces to split words (see pic below); as a result, in some cases you get worforms glued together with punctuation; moreover, this also confuses word indexes, and that might bring much more trouble! the word indexes ('spans') in the original file only work for the space-splitted data in the text column.

let's see what happens after you re-format the wikt dataset with the 'splitted' text column.

PolinaZulik commented 2 years ago

PS this single line needs changing: sent.append(str(row[1])) -> sent.append(str(row[5])), if I'm not mistaken.

Wheatley961 commented 2 years ago

@PolinaZulik , yep! I changed the line, and it worked. The issue was with spaces, indeed. Thanks! :-) But in the original datasets (like VUA18) there are no spaces.

Now there is a CUDA problem (even if I make the batch size lower) with our datasets. There is a chance of not using CUDA when available, but then during the first epoch I face the RAM issue again. :-( I'm trying to solve it.

PolinaZulik commented 2 years ago

Now there is a CUDA problem (even if I make the batch size lower)

What's the error output, and what's your batch size?

Wheatley961 commented 2 years ago

@PolinaZulik , the output is as follows: RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 11.17 GiB total capacity; 9.96 GiB already allocated; 9.19 MiB free; 10.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF The batch size is 32. I even used the export PYTORCH_CUDA_ALLOC_CONF = max_split_size_mb:64 construction, but no help. :-( It doesn't work if I make the batch size equal to 24/16...

PolinaZulik commented 2 years ago

It doesn't work if I make the batch size equal to 24/16...

Why not? Same error? Try 8/4 anyway! And probably restart the runtime before training a new model.

Wheatley961 commented 2 years ago

@PolinaZulik , false alarm! I seem to set the right parameters. It is working now. Sorry for disturbing you today.

Wheatley961 commented 2 years ago

@PolinaZulik , I have a question! In the paper it is said that the authors use the bagging technique for the VUAverb corpus which I am testing right now. Unfortunately, they don't provide any parameters needed for the experiment (we need to set num_bagging and bagging_index), so I have to choose them manually. All the rest parameters were set accodring to the paper (epochs, batch size, etc.), there were no problems. I have run several experiments, but still can't reach the results they present in the paper (the F-measure is about 75%, my results show only 62%). The only parameter that is the same is recall (about 73%). If I don't use the bagging technique, the F-measure is about 65% (but the authors don't provide any results for the experiments without this technique). Should I keep setting the bagging parameters till I get the closest results?

PolinaZulik commented 2 years ago

Should I keep setting the bagging parameters till I get the closest results?

No! Let's not waste time on this!

I suggest to check other things:

are you using the same RoBerta model as in the paper, or RuBert? Try using the exact same model. For contextualized models, we used a pre-trained RoBERTa with 12 layers, 12 attention heads in each layer, and 768 dimensions of the hidden state. https://huggingface.co/roberta-base - paper, p.5.
Check other things, use the final paper https://aclanthology.org/2021.naacl-main.141/. I expect we get similar results with your bagging parameters, maybe +-2-3%. If not, try another dataset.

I can look into the code tomorrow, if you don't manage the results discrepancies, plz let me know where to look.

PolinaZulik commented 2 years ago

Unfortunately, they don't provide any parameters needed for the experiment (we need to set num_bagging and bagging_index), so I have to choose them manually.

Oh look, here they seem to provide everything https://github.com/PolinaZulik/MelBERT/blob/main/scripts/run_bagging.sh !

Wheatley961 commented 2 years ago

@PolinaZulik , I am using RoBERTa for the VUAverb corpus, so that's ok! :-) RuBERT will be for our corpora. Thanks, I completely forgot about this code. I will try again now!

Wheatley961 commented 2 years ago

@PolinaZulik , I have finally made it! :-) You can find the results below.

Here is an original code tested on the VUAverb corpus (bagging technique). The results are as close to the described ones in the paper as it was possible to obtain. The settings are as follows: roberta-base, 3 epochs, MELBERT, 32 batch size, num_bagging = 10,bagging_index = 9. The results are as follows: acc = 0.74, precision = 0.54, recall = 0.80, f1 = 0.65.
I used rubert-base-cased-conversational for the Russian corpora.
Experimental settings for our corpora: 3 epochs, MELBERT, 32 batch size, num_bagging = 10,bagging_index = 9, no early stopping. Code. <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

Corpus | Accuracy | Precision | Recall | F1 -- | -- | -- | -- | -- Yulia’s verbs | 0.77 | 0.84 | 0.65 | 0.73 LCC | 0.91 | 0.88 | 0.93 | 0.91 Wiktionary | 0.71 | 0.59 | 0.77 | 0.67

Cross-validation. <html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

Train Corpus | Test Corpus | Accuracy | Precision | Recall | F1 -- | -- | -- | -- | -- | -- LCC | Wiktionary | 0.63 | 0.52 | 0.51 | 0.51 LCC | Yulia’s verbs | 0.69 | 0.80 | 0.51 | 0.62 Wiktionary | LCC | 0.54 | 0.52 | 0.98 | 0.68 Wiktionary | Yulia’s verbs | 0.77 | 0.81 | 0.69 | 0.75 Yulia’s verbs | LCC | 0.53 | 0.51 | 0.99 | 0.67 Yulia’s verbs | Wiktionary | 0.70 | 0.59 | 0.725 | 0.65

I am almost sure that there is something to comment on. Feel free to leave any comments! :-)

PolinaZulik commented 2 years ago

great @Wheatley961 ! I have 2 questions:

it's concerning that we get a much lower result than that reported in the paper. because the model then might be somewhat different, and if the original model gives better results on English corpora, it might work better on Russian, too. I'll look into the code once again.
in the paper, the authors suggest the use of RoBerta instead of Bert. I think these Russian RoBerta models (1, 2) are worth trying. please try them, too, or give your reasons why not (for example, the second one seems to be large, will it work in MelBERT? or will it take ages?)

PolinaZulik commented 2 years ago

looking at the code, I'm not sure I can check the parameters and/or arguments. namely, in the paper p.6 there's a lot about hyperparameters, warm-up, etc.:

could there be anything you've missed? have you used warmup, dropout ratio? basically, do you use these config parameters and get the F1=0.65 result?

Wheatley961 commented 2 years ago

@PolinaZulik , yes, all of them were used in the experiments. drop_ratio = 0.2, warmup_epoch = 2, lr_schedule = warmup_linear, etc. Of course, I will run experiments with Russian RoBERTa. Right now I can provide only two reasons of low F1. The first one is connected with some specific linguistic features of Russian metaphors (word order, diverse sentence length in our train/test datasets, etc.). The second one might be connected with the choice of a model. Till Friday I will try to obtain more results.

PolinaZulik commented 2 years ago

@Wheatley961 no, my concern is that we can't reproduce the English results!

Wheatley961 commented 2 years ago

@PolinaZulik , I also tried VUA18, but F1 was also about 65%. :-( Should I try one more time and change the parameters like warmup_epoch etc.?

Wheatley961 commented 2 years ago

@PolinaZulik , I am back with some good news! I looked one more time at the code and the checkpoint given and noticed that the random_seed variable was set to 3. I changed it in the code, chose the VUA18 corpus, and it did work! The results are almost the same as it is described in the paper. Our F1-score is 74.75%, the authors state they obtained 79.8% for F1 score for the same dataset with the same set of parameters. The only difference is that I used only 2 epochs as the third one was constsntly crashing while running even with GPU in my Colab (the corpus seemed to be large). But I suppose the results would be almost the same. Is it better? If so, I can re-run experiments for our datasets (with 3 epochs). :-)

PolinaZulik commented 2 years ago

yes, looks better! although it looks really bad for the MelBERT model, that the results vary so much because of the random seed. yes, please re-run on our corpora.

Wheatley961 commented 2 years ago

@PolinaZulik , OK, I will try to do it today/tomorrow.

Wheatley961 commented 2 years ago

@PolinaZulik , I have re-run experiments with xlm-roberta-large-en-ru (I had to lower the batch size to 11) but the results didn't show much improvement. Moreover, when dealing with the LCC corpus, the recall tends to be almost 1. The same tendency in some cases was even with rubert. Might the reason of it be in the disbalance of the LCC corpus? Still, here are the results.

Corpus | Accuracy | Precision | Recall | F1 -- | -- | -- | -- | -- Yulia’s verbs | 0.79 | 0.86 | 0.69 | 0.77 LCC | 0.50 | 0.50 | 0.99 | 0.67 Wiktionary | 0.69 | 0.57 | 0.74 | 0.64

Train Corpus | Test Corpus | Accuracy | Precision | Recall | F1 -- | -- | -- | -- | -- | -- LCC | Wiktionary | 0.38 | 0.38 | 0.99 | 0.55 LCC | Yulia’s verbs | 0.50 | 0.50 | 0.99 | 0.67 Wiktionary | LCC | 0.54 | 0.52 | 0.96 | 0.67 Wiktionary | Yulia’s verbs | 0.70 | 0.71 | 0.68 | 0.69 Yulia’s verbs | LCC | 0.56 | 0.53 | 0.98 | 0.69 Yulia’s verbs | Wiktionary | 0.71 | 0.61 | 0.69 | 0.64

PolinaZulik / metaphor-psycho

Test MelBERT metaphor identification on Russian corpora #10