allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

RoBERTa on SuperGLUE's 'Words in Context' task #4998

Open dirkgr opened 3 years ago

dirkgr commented 3 years ago

WiC is one of the tasks of the SuperGLUE benchmark. The task is to re-trace the steps of Facebook's RoBERTa paper (https://arxiv.org/pdf/1907.11692.pdf) and build an AllenNLP config that reads the WiC data and fine-tunes a model on it. We expect scores in the range of their entry on the SuperGLUE leaderboard.

This can be formulated as a classification task, using the TransformerClassificationTT model, analogous to the IMDB model. You can start with the experiment config and dataset reading step from IMDB, and adapt them to your needs.

Soumyajain29 commented 3 years ago

Hi, I am GSoC '21 aspirant and an NLP enthusiastic. I find this project interesting and want to work on it. It will be my first contribution, and help with getting me started will be highly appreciated.

dirkgr commented 3 years ago

Wonderful!

If you've never used AllenNLP, I recommend you start with the guide. With what you learned there, you can probably train one of our multiple choice models. Those training configs are at https://github.com/allenai/allennlp-models/tree/main/training_config/mc. Multiple choice is arguably the closest thing we have to WiC (which is a binary classification task).

For your own code, I recommend you create a new project based on the template at https://github.com/allenai/allennlp-template-config-files. Start implementing the DatasetReader first, because that's easiest.

To save time, you can steal code from the existing multiple-choice models. The model for that is here: https://github.com/allenai/allennlp-models/blob/main/allennlp_models/mc/models/transformer_mc.py. The dataset readers are all in https://github.com/allenai/allennlp-models/tree/main/allennlp_models/mc/dataset_readers.

Let us know if you need anything else to get started!

Soumyajain29 commented 3 years ago

Hi, Thanks for providing pointers. I am very comfortable with python and PyTorch. I am a bit familiar with AllenNLP too. I have started looking into them and working on a multiple-choice model. I will get back here in few days after finishing all these.

You will be the mentor for this GSoC project, right? Is it possible for you to give me a brief idea about entering GSoC'21 formally, what the expectations are, and is there a specific format for the project proposal?

If it's fine, can I mail you regarding these queries instead of posting them here?

dirkgr commented 3 years ago

You can definitely email us! If you have questions that you think might be interesting to many people, you can also create a topic on the discussions page: https://github.com/allenai/allennlp/discussions/categories/google-summer-of-code

I will be one of the mentors, yes. I don't know if I will be the mentor for this particular task. We won't know this until the GSoC selection process has happened and we know all the participants. But I assure you the others are at least as capable as me :-)

dirkgr commented 3 years ago

@Soumyajain29, since we are not going to be part of Google Summer of Code, are you still interested in this task? Otherwise I'd like to make it available to someone else.

dirkgr commented 3 years ago

I updated the description of this task to recommend the new Tango framework.