PetrochukM / PyTorch-NLP

Basic Utilities for PyTorch Natural Language Processing (NLP)
https://pytorchnlp.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2.21k stars 257 forks source link

Add GLUE datasets #26

Open PetrochukM opened 6 years ago

PetrochukM commented 6 years ago

GLUE datasets are standard for evaluating NLU tasks.

In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks.

PattynR commented 6 years ago

Hi, I am a Belgian student in computer engineering, I am following an introduction course about open source. One of my goal this semester is to make a contribution to a project. My master thesis will be related to NLP, this is why this project interest me. Is there a way I could help fixing this issue? (or maybe another issue related to this project)

PetrochukM commented 6 years ago

Hi There!

Yeah, please fix this issue! GLUE datasets are a popular suite of datasets for evaluating NLP models. It'd be nice if there was support for those datasets. This issue should be an easy one to get started with.

Recently, I was at Belgium for EMNLP 2018. One of the best NLP conferences in the world.

PattynR commented 5 years ago

Hey, so bad I missed the EMNLP! This is the first year I work on NLP, and I had never heard about those conferences, I hope I'll be able to go there next year. About the issue, could you please confirm that my job is to add a new file into the torchnlp/datasets folder? A file that would be named "glue.py". I guess this is what I have to do, but I would prefer to be completely sure!

PetrochukM commented 5 years ago

Yeah that'd work!

PattynR commented 5 years ago

Hi, I'm almost done, for the moment it works for all the datasets of GLUE except for QQP and SNLI. There is an issue with those files that I don't know how to handle ... When I load the QQP and SNLI datasets, there are some lines in the files themselves that doesn't have the right amount of parameters. Here is an example to illustrate what I mean.

On the first line of each downloaded file, we can find the names of the different features of the tsv file. In the 'train.tsv' file of SNLI for example, there should be 11 features per line. There are however a lot of lines (38.656 in total) where there are more than 10 tabs, so more than 11 features ....

For the moment I decided not to add those lines in the Dataset object, but I know this is not what should be done. I've looked on the internet to find a meaning to those lines, but there is not a lot of documentation about QQP and SNLI.

So do you maybe know what I should do? Or should I add my file to the project, and create a new issue? Someone that has already worked with those datasets should be able to fix it easily.

Thanks.

PetrochukM commented 4 years ago

Thanks for your attempt at contributing this function: https://github.com/PetrochukM/PyTorch-NLP/pull/60 :)

karish-grover commented 3 years ago

Hey! I want to give this a try. Is there any way that I can do it still? It seems like it's too late to contribute to this project.