Code to reproduce results from the paper

soumyasanyal commented 3 years ago

Hi @OyvindTafjord,

Thanks for sharing the data generator and the corresponding datasets used in the paper. Are you also planning to share the experimental setup to reproduce results from the paper? I've been trying to reproduce the results using the following hyperparameters (following RoBERTa paper):

model: https://huggingface.co/LIAMF-USP/roberta-large-finetuned-race optimizer: adamw learning rate: 1e-05 adam epsilon: 1e-08 lr_scheduler: linear (with warmup) warmup fraction: 0.06 weight decay: 0.1 batch size: 32

RoBERTa was finetuned for 4 epochs on RACE. But I find that the models do not converge (remains ~60% dev accuracy) to the reported numbers in 4 epochs for RuleTaker datasets. Are your results reported after fine-tuning for 4 epochs? It'd be great if you could share your code and hyperparameter details to reproduce the results.

OyvindTafjord commented 3 years ago

Hi, those look close to the parameters we've been using, a reference config I have is identical except for batch size of 16 and some other adamw parameters (default 1e-6 epsilon and betas = [0.9, 0.98]). I'd be surprised if that makes a difference. Note it should not be required to do the RACE pretraining, we've gotten similar results just starting from plain RoBERTa.

There is an AllenNLP model implementation for the RuleTaker dataset available in https://github.com/alontalmor/LeapOfThought/tree/master/LeapOfThought/allennlp_models, if need be, I can send you an exact AllenNLP training configuration for this model.

In general we found training to be quite stable on this dataset, so didn't do much hyperparameter tweaking, although we do see the occasional non-training as you got here.

soumyasanyal commented 3 years ago

Hi, Thanks for sharing the pointers, I'll take a look. In one of my further experiments, I left the model to training with unrestricted epochs. Seems like they are starting to converge at around ~30ish epochs. So, I was curious if you got all the reported results within 4 epochs of fine-tuning or they can require more tuning (like above)?

Also, thanks for the tip on not requiring RACE fine-tuning, I'll give that a shot as well!

OyvindTafjord commented 3 years ago

My experience was that not many epochs were needed so that's a bit strange, unless it gets stuck somehow, and then potentially it can get unstuck at some later epoch.

soumyasanyal commented 3 years ago

Hi, I've been looking at the model code at LeapOfThought and made my architecture similar to the one mentioned. But my models still don't converge (~80% accuracy after 4 epochs on D<=1 dataset). I think I'm might be missing some subtle things in the data processing or some other step. While I continue to look through the repo, it'd really help me if you can share the rueltaker config you mentioned earlier so that I can also debug the code and understand the differences. Thanks!

OyvindTafjord commented 3 years ago

Here's an AllenNLP training config for the depth-3 dataset, without RACE pretraining, for which I got validation accuracy 99.4 (validation accuracy after each epoch: 94.4, 98.2, 99.2, 99.4). The main advantage of the RACE pretraining seems to be to lower the chance that the training gets stuck at ~50% accuracy (which I've seen more often without the RACE pretraining). It's strange that you get stuck at 80% for the D<=1 dataset though, I'm used to seeing either stuck at ~50% or getting to very high accuracy. .

FWIW, here's a sample tokenization for an input, in case it helps debugging:

qa_tokens = [<s>, If, ĠFiona, Ġis, Ġkind, Ġand, ĠFiona, Ġis, Ġnot, Ġnice, Ġthen, ĠFiona, Ġis, Ġyoung, ., ĠIf, Ġsomething, Ġis, Ġred, Ġthen, Ġit, Ġis, Ġnot, Ġnice, ., ĠErin, Ġis, Ġwhite, ., ĠFiona, Ġis, Ġnot, Ġred, ., ĠDave, Ġis, Ġnot, Ġwhite, ., ĠIf, Ġsomething, Ġis, Ġnice, Ġand, Ġnot, Ġyoung, Ġthen, Ġit, Ġis, Ġquiet, ., ĠIf, Ġsomething, Ġis, Ġnot, Ġyoung, Ġthen, Ġit, Ġis, Ġquiet, ., ĠIf, Ġsomething, Ġis, Ġnot, Ġsmart, Ġthen, Ġit, Ġis, Ġquiet, ., ĠDave, Ġis, Ġquiet, ., ĠIf, Ġsomething, Ġis, Ġyoung, Ġand, Ġquiet, Ġthen, Ġit, Ġis, Ġred, ., ĠDave, Ġis, Ġsmart, ., ĠIf, Ġsomething, Ġis, Ġkind, Ġand, Ġnot, Ġquiet, Ġthen, Ġit, Ġis, Ġred, ., ĠFiona, Ġis, Ġsmart, ., ĠIf, ĠDave, Ġis, Ġnice, Ġthen, ĠDave, Ġis, Ġsmart, ., ĠBob, Ġis, Ġyoung, ., </s>, </s>, Dave, Ġis, Ġsmart, ., </s>]

soumyasanyal commented 3 years ago

Hi, thanks for sharing the training config. I had to make some minor changes to make it work with the LeapOfFaith repo (updated config). Turns out, the learning rate scheduler was incorrectly set up in my code which lead to this issue all along. Pytorch Lightning has a very unique way of registering schedulers where we need to mention the update interval as step (which defaults to epoch) and I missed that part of the code.

Closing this now. Thanks again!

allenai / ruletaker

Code to reproduce results from the paper #25