Closed cooelf closed 2 years ago
Hi @cooelf, we used the microsoft/deberta-base
from HuggingFace, which I guess is the base configuration of V1? I see that both V2 and V3 use the deberta-v2
prototype (model type) from HuggingFace.
We used a learning rate of 3e-5 across all base-sized models. No warm-up or anything else special. We also use early stopping up to 20 epochs in total with a patience for 3 epochs.
We were only able to benchmark the large version of RoBERTa, you may find the result on the Appendix of our paper. In this case, we used a learning rate of 1e-5, warm-up ratio of 0.06, and weight decay of 0.1, since we found that larger models are very unstable, and "degenerate" with larger learning rates and no warm-up.
Hi @iliaschalkidis, Thanks a lot for the quick reply. Yeah, I also found large models work unstably on the dataset. Maybe it is because I was using microsoft/deberta-v3-large. I will check the appendix and try the recommended settings :)
Thanks!
Hi, my reproduced results for EUR-LEX are quite far from the reported ones. Could you provide the hyper-parameters of DeBERTa for EUR-LEX? And which version of DeBERTa is used, V2/V3, Base/Large?
Looking forward to your reply. Thanks!