GLUE (MNLI section) - Natural Language Inference task.
Labels: contradiction, neutral, entailment
Example:Premise: How do you know? All this is their information again. Hypothesis: This information belongs to them.Answer:entailmentDataset:https://huggingface.co/datasets/glue/viewer/mnli/train
SNLI (Stanford Natural Language Inference) - same design as GLUE MNLI
Labels: contradiction, neutral, entailment
Example:Premise: Two blond women are hugging one another. Hypothesis: The women are sleeping.Answer:contradictionDataset:https://huggingface.co/datasets/snli
HANS (Heuristic Analysis for NLI Systems) - NLI evaluation dataset that tests specific hypotheses about invalid heuristics that NLI models are likely to learn. This is an interesting dataset to do probing since we can compare it to the implementation from the paper where the dataset was introduced (https://arxiv.org/pdf/1902.01007.pdf). They fine-tuned BERT for this task and we can compare if probing internal representations can have better performance for handling invalid heuristics.
Labels: entailment, non-entailment
Example:The doctors supported the scientist. The scientist supported the doctors.Answer:non-entailmentDrawbacks: Since the task was designed to demonstrate that models are likely to follow shallow heuristics and shortcuts, probing can also struggle and have poor results.
Dataset:https://github.com/tommccoy1/hans
IMDb Movie Reviews - binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. In my opinion, probing can capture positive and negative concepts well from reviews because of important words for the classes (e.g. "bad, awful, boring, sad" for negative, "good, wonderful, interesting, incredible" for positive).
Labels: positive, negative
Example:Ned aKelly is such an important story to Australians but this movie is awful. It's an Australian story yet it seems like it was set in America. Also Ned was an Australian yet he has an Irish accent...it is the worst film I have seen in a long time.Answer:negativeDrawbacks: Some reviews are quite long, about 500 words. If our model can handle long input representations, than this dataset can be good for probing.
Dataset:https://huggingface.co/datasets/imdb
Hate Speech Offensive - dataset for hate speech and offensive language detection on tweets. I think that probing can show good results on this data since offensive speech consists "bad words", which are rare in common speech and most datasets. Therefore, this "bad concepts" will be captured well in internal representations.
Labels: offensive language, neither, hate speech
Example:I spend my money how i want bitch its my businessAnswer:offensive languageDrawbacks: Since the dataset is about hate speech, every example contains swear words, and it is always complicated to write a paper with bad words in experiments or in examples.
Dataset:https://huggingface.co/datasets/hate_speech_offensive
GoEmotions - corpus of carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or neutral. It is interesting to check if it is possible to probe emotion concepts.
Labels: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral
Example:Man I love reddit.Answer:loveDrawbacks: Many labels (28 emotions) can be hard to train using internal representations.
Dataset:https://huggingface.co/datasets/go_emotions?row=17
Tweet sentiment extraction - dataset of tweets with sentiment classes for every sentence. Probing can capture positive and negative concepts well because of important words.
Labels: positive, neutral, negative
Example:I really really like the song Love Story by Taylor SwiftAnswer:positiveDataset:https://huggingface.co/datasets/mteb/tweet_sentiment_extraction
BoolQ - question answering dataset for yes/no questions. This dataset is close to true/false SAPLMA dataset, but has a little different design: it has 2 sentences: question and passage, and then the true/false label.
Labels: true, false
Example:Question: did abraham lincoln write the letter in saving private ryan? Passage: In the 1998 war film Saving Private Ryan, General George Marshall (played by Harve Presnell) reads the Bixby letter to his officers before giving the order to find and send home Private James Francis Ryan after Ryan's three brothers died in battle.Answer:trueDrawbacks: Some passages are long. Also, since we already have many true/false dataset from SAPLMA, maybe it is better to introduce more experiments on different tasks for more variety.
Dataset:https://huggingface.co/datasets/boolq
Possible datasets for probing task:
GLUE (MNLI section) - Natural Language Inference task. Labels: contradiction, neutral, entailment Example: Premise: How do you know? All this is their information again. Hypothesis: This information belongs to them. Answer: entailment Dataset: https://huggingface.co/datasets/glue/viewer/mnli/train
SNLI (Stanford Natural Language Inference) - same design as GLUE MNLI Labels: contradiction, neutral, entailment Example: Premise: Two blond women are hugging one another. Hypothesis: The women are sleeping. Answer: contradiction Dataset: https://huggingface.co/datasets/snli
HANS (Heuristic Analysis for NLI Systems) - NLI evaluation dataset that tests specific hypotheses about invalid heuristics that NLI models are likely to learn. This is an interesting dataset to do probing since we can compare it to the implementation from the paper where the dataset was introduced (https://arxiv.org/pdf/1902.01007.pdf). They fine-tuned BERT for this task and we can compare if probing internal representations can have better performance for handling invalid heuristics. Labels: entailment, non-entailment Example: The doctors supported the scientist. The scientist supported the doctors. Answer: non-entailment Drawbacks: Since the task was designed to demonstrate that models are likely to follow shallow heuristics and shortcuts, probing can also struggle and have poor results. Dataset: https://github.com/tommccoy1/hans
IMDb Movie Reviews - binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. In my opinion, probing can capture positive and negative concepts well from reviews because of important words for the classes (e.g. "bad, awful, boring, sad" for negative, "good, wonderful, interesting, incredible" for positive). Labels: positive, negative Example: Ned aKelly is such an important story to Australians but this movie is awful. It's an Australian story yet it seems like it was set in America. Also Ned was an Australian yet he has an Irish accent...it is the worst film I have seen in a long time. Answer: negative Drawbacks: Some reviews are quite long, about 500 words. If our model can handle long input representations, than this dataset can be good for probing. Dataset: https://huggingface.co/datasets/imdb
Hate Speech Offensive - dataset for hate speech and offensive language detection on tweets. I think that probing can show good results on this data since offensive speech consists "bad words", which are rare in common speech and most datasets. Therefore, this "bad concepts" will be captured well in internal representations. Labels: offensive language, neither, hate speech Example: I spend my money how i want bitch its my business Answer: offensive language Drawbacks: Since the dataset is about hate speech, every example contains swear words, and it is always complicated to write a paper with bad words in experiments or in examples. Dataset: https://huggingface.co/datasets/hate_speech_offensive
GoEmotions - corpus of carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or neutral. It is interesting to check if it is possible to probe emotion concepts. Labels: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral Example: Man I love reddit. Answer: love Drawbacks: Many labels (28 emotions) can be hard to train using internal representations. Dataset: https://huggingface.co/datasets/go_emotions?row=17
Tweet sentiment extraction - dataset of tweets with sentiment classes for every sentence. Probing can capture positive and negative concepts well because of important words. Labels: positive, neutral, negative Example: I really really like the song Love Story by Taylor Swift Answer: positive Dataset: https://huggingface.co/datasets/mteb/tweet_sentiment_extraction
BoolQ - question answering dataset for yes/no questions. This dataset is close to true/false SAPLMA dataset, but has a little different design: it has 2 sentences: question and passage, and then the true/false label. Labels: true, false Example: Question: did abraham lincoln write the letter in saving private ryan? Passage: In the 1998 war film Saving Private Ryan, General George Marshall (played by Harve Presnell) reads the Bixby letter to his officers before giving the order to find and send home Private James Francis Ryan after Ryan's three brothers died in battle. Answer: true Drawbacks: Some passages are long. Also, since we already have many true/false dataset from SAPLMA, maybe it is better to introduce more experiments on different tasks for more variety. Dataset: https://huggingface.co/datasets/boolq