Open PolinaZulik opened 2 years ago
@Wheatley961 if you struggle to find English corpora to reproduce the results, I've come accross this paper, where they list all available corpora with download links on p.6.
@PolinaZulik , there are no problems in finding original corpora. :-) Here are some results, but I do need your help! First of all, it was necessary for the MelBERT algorithm to change the structure of input datasets. The first column is indices, the second one is classes (0 or 1), the third one is texts (sentences), the fourth one is pos, the latter one is word indices. That's why I created a code for saving all your datasets in new formats in new folders. They are in the folders with the _formatted indices. Then I adapted MelBERT for both original datasets and our ones (with some commentaries for you to simplify its usage. Our datasets are adapted with the vua algorithm. The problems are as follows:
1) The original datasets are time- and RAM-consuming, sometimes I need to wait for 15 or more hours to obtain the original results presented in the paper (just for one dataset). There are some samples provided when we install this GitHub repo to our coding environment, they are less in size, but F-measure and all the parameters are different (we don't need it). When I train and test full datasets, the RAM is over, there is a crash and I have to start the experiments from the very beginning. Do you have an opportunity to run experiments on your computer using these codes in Colab (I hope you have more RAM in your account)?
2) I use rubert-base-cased-conversational for our datasets. LCC and Yulia's files are fine, but when training, I face a RAM-crash in Colab again. I guess it can be easily eliminated with using another coding environment. The trick with Wiktionary verbs. While training, I got the following error: /content/MelBERT/run_classifier_dataset_utils.py in convert_examples_to_two_features(examples, label_list, max_seq_length, tokenizer, output_mode, args) 509 comma2 = tokenizer.tokenize(" ,")[0] 510 for i, w in enumerate(tokens): --> 511 if i < tokens_b + 1 and (w in [comma1, comma2]): 512 local_start = i 513 if i > tokens_b + 1 and (w in [comma1, comma2]): TypeError: can only concatenate list (not "int") to list I problem is with the Wiktionary train file. When I change it to the train file of Yulia, for instance, it works fine (but RAM again!). I have been reflecting on the problem for several days and digging in the codes and train files, but I lost any hope to get rid of the problem. If you have some free time to give some advice, it will be perfect.
- The original datasets are time- and RAM-consuming, sometimes I need to wait for 15 or more hours to obtain the original results presented in the paper (just for one dataset). There are some samples provided when we install this GitHub repo to our coding environment, they are less in size, but F-measure and all the parameters are different (we don't need it). When I train and test full datasets, the RAM is over, there is a crash and I have to start the experiments from the very beginning. Do you have an opportunity to run experiments on your computer using these codes in Colab (I hope you have more RAM in your account)?
please use GPU - graphic processor unit! - on Colab. it also comes with more RAM. it's almost impossible to use BERT-based models without GPU - you'd have to set batch size to 1 and it'd take ages. I'll add instructions on GPU in colab below.
- I use rubert-base-cased-conversational for our datasets. LCC and Yulia's files are fine, but when training, I face a RAM-crash in Colab again. I guess it can be easily eliminated with using another coding environment. The trick with Wiktionary verbs. While training, I got the following error:
I'll have a look, but for sure, the problem is in the format - it might be slightly different between Wiktionary and Yulia. namely, in your code the problem is that tokens_b
is a list, not an integer! I think you just have to pre-format the tokens_b
before line 513, while reading the dataset. I'll take a deeper look in the code.
Add GPU in colab:
- I need to wait for 15 or more hours to obtain the original results presented in the paper (just for one dataset).
btw you don't have to check all the English datasets, just in case you're interested and you have the time. the goal of reproducing is to make sure we run the original model correctly and the results are the same. 1 dataset would be enough I think, 2 is more than enough.
2. /content/MelBERT/run_classifier_dataset_utils.py in convert_examples_to_two_features(examples, label_list, max_seq_length, tokenizer, output_mode, args) 509 comma2 = tokenizer.tokenize(" ,")[0] 510 for i, w in enumerate(tokens): --> 511 if i < tokens_b + 1 and (w in [comma1, comma2]): 512 local_start = i 513 if i > tokens_b + 1 and (w in [comma1, comma2]): TypeError: can only concatenate list (not "int") to list
unfortunately, I can't run the code currently, because I don't have the folder structure with the MelBERT code on colab. but I can see some important points:
if type(tokens_b) != int:
print(tokens_b)
, so that we can see which sentence causes the problem.
my bet is that (1) in wikt_formatted, you're using the original sentence column with no spaces between words and puncuation; (2) MelBERT dataset loader in run_classifier_dataset_utils.py
uses spaces to split words (see pic below); as a result, in some cases you get worforms glued together with punctuation; moreover, this also confuses word indexes, and that might bring much more trouble! the word indexes ('spans') in the original file only work for the space-splitted data in the text column.
let's see what happens after you re-format the wikt dataset with the 'splitted' text column.
PS this single line needs changing:
sent.append(str(row[1]))
-> sent.append(str(row[5]))
, if I'm not mistaken.
@PolinaZulik , yep! I changed the line, and it worked. The issue was with spaces, indeed. Thanks! :-) But in the original datasets (like VUA18) there are no spaces.
Now there is a CUDA problem (even if I make the batch size lower) with our datasets. There is a chance of not using CUDA when available, but then during the first epoch I face the RAM issue again. :-( I'm trying to solve it.
Now there is a CUDA problem (even if I make the batch size lower)
What's the error output, and what's your batch size?
@PolinaZulik , the output is as follows: RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 11.17 GiB total capacity; 9.96 GiB already allocated; 9.19 MiB free; 10.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF The batch size is 32. I even used the export PYTORCH_CUDA_ALLOC_CONF = max_split_size_mb:64 construction, but no help. :-( It doesn't work if I make the batch size equal to 24/16...
It doesn't work if I make the batch size equal to 24/16...
Why not? Same error? Try 8/4 anyway! And probably restart the runtime before training a new model.
@PolinaZulik , false alarm! I seem to set the right parameters. It is working now. Sorry for disturbing you today.
@PolinaZulik , I have a question! In the paper it is said that the authors use the bagging technique for the VUAverb corpus which I am testing right now. Unfortunately, they don't provide any parameters needed for the experiment (we need to set num_bagging
and bagging_index
), so I have to choose them manually. All the rest parameters were set accodring to the paper (epochs, batch size, etc.), there were no problems. I have run several experiments, but still can't reach the results they present in the paper (the F-measure is about 75%, my results show only 62%). The only parameter that is the same is recall (about 73%). If I don't use the bagging technique, the F-measure is about 65% (but the authors don't provide any results for the experiments without this technique). Should I keep setting the bagging parameters till I get the closest results?
Should I keep setting the bagging parameters till I get the closest results?
No! Let's not waste time on this!
I suggest to check other things:
I can look into the code tomorrow, if you don't manage the results discrepancies, plz let me know where to look.
Unfortunately, they don't provide any parameters needed for the experiment (we need to set num_bagging and bagging_index), so I have to choose them manually.
Oh look, here they seem to provide everything https://github.com/PolinaZulik/MelBERT/blob/main/scripts/run_bagging.sh !
@PolinaZulik , I am using RoBERTa for the VUAverb corpus, so that's ok! :-) RuBERT will be for our corpora. Thanks, I completely forgot about this code. I will try again now!
@PolinaZulik , I have finally made it! :-) You can find the results below.
num_bagging = 10
,bagging_index = 9
. The results are as follows: acc = 0.74, precision = 0.54, recall = 0.80, f1 = 0.65.num_bagging = 10
,bagging_index = 9
, no early stopping. Code.
<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
run metaphor detection by MelBERT on 3 corpora: LCC, Yulia's verbs, wiktionary verbs.