JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.88k stars 713 forks source link

T5 Task documentation #2101

Closed C-K-Loan closed 3 years ago

C-K-Loan commented 3 years ago

I put together this documentation on every task T5 was trained on and that is accessible. I managed to run everything properly besides task13. Once these docs are finalized they need to be added to the T5 models on the modelshub.

Overview of every task available with T5

The T5 model is trained on various datasets for 18 different tasks which fall into 8 categories.

  1. Text summarization
  2. Question answering
  3. Translation
  4. Sentiment analysis
  5. Natural Language inference
  6. Coreference resolution
  7. Sentence Completion
  8. Word sense disambiguation

Every T5 Task with explanation:

Task Name Explanation
1.CoLA Classify if a sentence is gramaticaly correct
2.RTE Classify whether if a statement can be deducted from a sentence
3.MNLI Classify for a hypothesis and premise whether they contradict or contradict each other or neither of both (3 class).
4.MRPC Classify whether a pair of sentences is a re-phrasing of each other (semantically equivalent)
5.QNLI Classify whether the answer to a question can be deducted from an answer candidate.
6.QQP Classify whether a pair of questions is a re-phrasing of each other (semantically equivalent)
7.SST2 Classify the sentiment of a sentence as positive or negative
8.STSB Classify the sentiment of a sentence on a scale from 1 to 5 (21 Sentiment classes)
9.CB Classify for a premise and a hypothesis whether they contradict each other or not (binary).
10.COPA Classify for a question, premise, and 2 choices which choice the correct choice is (binary).
11.MultiRc Classify for a question, a paragraph of text, and an answer candidate, if the answer is correct (binary),
12.WiC Classify for a pair of sentences and a disambigous word if the word has the same meaning in both sentences.
13.WSC/DPR Predict for an ambiguous pronoun in a sentence what it is referring to.
14.Summarization Summarize text into a shorter representation.
15.SQuAD Answer a question for a given context.
16.WMT1. Translate English to German
17.WMT2. Translate English to French
18.WMT3. Translate English to Romanian

Information about pre-procession for T5 tasks

Tasks that require no pre-processing

The following tasks work fine without any additional pre-processing, only setting the task parameter on the T5 model is required:

Tasks that require pre-processing with 1 tag

The following tasks require exactly 1 additional tag added by manual pre-processing. Set the task parameter and then join the sentences on the tag for these tasks.

Tasks that require pre-processing with multiple tags

The following tasks require more than 1 additional tag added manual by pre-processing. Set the task parameter and then prefix sentences with their corresponding tags and join them for these tasks:

WSC/DPR is a special case that requires * surrounding

The task WSC/DPR requires highlighting a pronoun with * and configuring a task parameter.







The following sections describe each task in detail, with an example and also a pre-processed example.

NOTE: Linebreaks are added to the pre-processed examples in the following section. The T5 model also works with linebreaks, but it can hinder the performance and it is not recommended it intentionally add them.

Task 1 CoLA - Binary Grammatical Sentence acceptability classification

Judges if a sentence is grammatically acceptable.
This is a sub-task of GLUE.

Example

sentence prediction
Anna and Mike is going skiing and they is liked is unacceptable
cola sentence: Anna and Mike like to dance acceptable

How to configure T5 task for CoLA

.setTask(cola sentence:) prefix.

Example pre-processed input for T5 CoLA sentence acceptability judgement:

cola 
sentence: Anna and Mike is going skiing and they is liked is

Task 2 RTE - Natural language inference Deduction Classification

The RTE task is defined as recognizing, given two text fragments, whether the meaning of one text can be inferred (entailed) from the other or not.
Classification of sentence pairs as entailed and not_entailed
This is a sub-task of GLUE and SuperGLUE.

Example

sentence 1 sentence 2 prediction
Kessler ’s team conducted 60,643 interviews with adults in 14 countries. Kessler ’s team interviewed more than 60,000 adults in 14 countries entailed
Peter loves New York, it is his favorite city Peter loves new York. entailed
Recent report say Johnny makes he alot of money, he earned 10 million USD each year for the last 5 years. Johnny is a millionare entailment
Recent report say Johnny makes he alot of money, he earned 10 million USD each year for the last 5 years. Johnny is a poor man not_entailment
It was raining in England for the last 4 weeks England was very dry yesterday not_entailment

How to configure T5 task for RTE

.setTask('rte sentence1:) and prefix second sentence with sentence2:

Example pre-processed input for T5 RTE - 2 Class Natural language inference

rte 
sentence1: Recent report say Peter makes he alot of money, he earned 10 million USD each year for the last 5 years. 
sentence2: Peter is a millionare.

References

Task 3 MNLI - 3 Class Natural Language Inference 3-class contradiction classification

Classification of sentence pairs with the labels entailment, contradiction, and neutral.
This is a sub-task of GLUE.

This classifier predicts for two sentences :

Hypothesis Premise prediction
Recent report say Johnny makes he alot of money, he earned 10 million USD each year for the last 5 years. Johnny is a poor man. contradiction
It rained in England the last 4 weeks. It was snowing in New York last week neutral

How to configure T5 task for MNLI

.setTask('mnli hypothesis:) and prefix second sentence with premise:

Example pre-processed input for T5 MNLI - 3 Class Natural Language Inference

mnli 
hypothesis: At 8:34, the Boston Center controller received a third, transmission from American 11.    
premise: The Boston Center controller got a third transmission from American 11.

Task 4 MRPC - Binary Paraphrasing/ sentence similarity classification

Detect whether one sentence is a re-phrasing or similar to another sentence
This is a sub-task of GLUE.

Sentence1 Sentence2 prediction
We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said . Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " . equivalent
I like to eat peanutbutter for breakfast I like to play football not_equivalent

How to configure T5 task for MRPC

.setTask('mrpc sentence1:) and prefix second sentence with sentence2:

Example pre-processed input for T5 MRPC - Binary Paraphrasing/ sentence similarity

mrpc 
sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said . 
sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11",

ISSUE: Can only get neutral and contradiction as prediction results for tested samples but no entailment predictions.

Task 5 QNLI - Natural Language Inference question answered classification

Classify whether a question is answered by a sentence (entailed).
This is a sub-task of GLUE.

Question Answer prediction
Where did Jebe die? Ghenkis Khan recalled Subtai back to Mongolia soon afterward, and Jebe died on the road back to Samarkand entailment
What does Steve like to eat? Steve watches TV all day not_netailment

How to configure T5 task for QNLI - Natural Language Inference question answered classification

.setTask('QNLI sentence1:) and prefix question with question: sentence with sentence::

Example pre-processed input for T5 QNLI - Natural Language Inference question answered classification

qnli
question: Where did Jebe die?     
sentence: Ghenkis Khan recalled Subtai back to Mongolia soon afterwards, and Jebe died on the road back to Samarkand,

Task 6 QQP - Binary Question Similarity/Paraphrasing

Based on a quora dataset, determine whether a pair of questions are semantically equivalent.
This is a sub-task of GLUE.

Question1 Question2 prediction
What attributes would have made you highly desirable in ancient Rome? How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER? not_duplicate
What was it like in Ancient rome? What was Ancient rome like? duplicate

How to configure T5 task for QQP

.setTask('qqp question1:) and prefix second sentence with question2:

Example pre-processed input for T5 QQP - Binary Question Similarity/Paraphrasing

qqp 
question1: What attributes would have made you highly desirable in ancient Rome?        
question2: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?',

Task 7 SST2 - Binary Sentiment Analysis

Binary sentiment classification.
This is a sub-task of GLUE.

Sentence1 Prediction
it confirms fincher ’s status as a film maker who artfully bends technical know-how to the service of psychological insight positive
I really hated that movie negative

How to configure T5 task for SST2

.setTask('sst2 sentence: ')

Example pre-processed input for T5 SST2 - Binary Sentiment Analysis

sst2
sentence: I hated that movie

Task8 STSB - Regressive semantic sentence similarity

Measures how similar two sentences are on a scale from 0 to 5 with 21 classes representing a regressive label.
This is a sub-task of GLUE.

Question1 Question2 prediction
What attributes would have made you highly desirable in ancient Rome? How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER? 0
What was it like in Ancient rome? What was Ancient rome like? 5.0
What was live like as a King in Ancient Rome?? What is it like to live in Rome? 3.2

How to configure T5 task for STSB

.setTask('stsb sentence1:) and prefix second sentence with sentence2:

Example pre-processed input for T5 STSB - Regressive semantic sentence similarity

stsb
sentence1: What attributes would have made you highly desirable in ancient Rome?        
sentence2: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?',

Task 9 CB - Natural language inference contradiction classification

Classify whether a Premise contradicts a Hypothesis.
Predicts entailment, neutral and contradiction
This is a sub-task of SuperGLUE.

Hypothesis Premise Prediction
Valence was helping Valence the void-brain, Valence the virtuous valet. Why couldn’t the figger choose his own portion of titanic anatomy to shaft? Did he think he was helping' Contradiction

How to configure T5 task for CB

.setTask('cb hypothesis:) and prefix premise with premise:

Example pre-processed input for T5 CB - Natural language inference contradiction classification

cb 
hypothesis: What attributes would have made you highly desirable in ancient Rome?        
premise: How I GET OPPERTINUTY TO JOIN IT COMPANY AS A FRESHER?',

Task 10 COPA - Sentence Completion/ Binary choice selection

The Choice of Plausible Alternatives (COPA) task by Roemmele et al. (2011) evaluates causal reasoning between events, which requires commonsense knowledge about what usually takes place in the world. Each example provides a premise and either asks for the correct cause or effect from two choices, thus testing either backward or forward causal reasoning. COPA data, which consists of 1,000 examples total, can be downloaded at https://people.ict.usc.e

This is a sub-task of SuperGLUE.

This classifier selects from a choice of 2 options which one the correct is based on a premise.

forward causal reasoning

Premise: The man lost his balance on the ladder.
question: What happened as a result?
Alternative 1: He fell off the ladder.
Alternative 2: He climbed up the ladder.

backwards causal reasoning

Premise: The man fell unconscious. What was the cause of this?
Alternative 1: The assailant struck the man in the head.
Alternative 2: The assailant took the man’s wallet.

Question Premise Choice 1 Choice 2 Prediction
effect Politcal Violence broke out in the nation. many citizens relocated to the capitol. Many citizens took refuge in other territories Choice 1
correct The men fell unconscious The assailant struckl the man in the head he assailant s took the man's wallet. choice1

How to configure T5 task for COPA

.setTask('copa choice1:), prefix choice2 with choice2: , prefix premise with premise: and prefix the question with question

Example pre-processed input for T5 COPA - Sentence Completion/ Binary choice selection

copa 
choice1:   He fell off the ladder    
choice2:   He climbed up the lader       
premise:   The man lost his balance on the ladder 
question:  effect

Task 11 MultiRc - Question Answering

Evaluates an answer for a question as true or false based on an input paragraph The T5 model predicts for a question and a paragraph of sentences wether an answer is true or not, based on the semantic contents of the paragraph.
This is a sub-task of SuperGLUE.

Exeeds human performance by a large margin

Question Answer Prediction paragraph
Why was Joey surprised the morning he woke up for breakfast? There was only pie to eat, rather than traditional breakfast foods True Once upon a time, there was a squirrel named Joey. Joey loved to go outside and play with his cousin Jimmy. Joey and Jimmy played silly games together, and were always laughing. One day, Joey and Jimmy went swimming together 50 at their Aunt Julie’s pond. Joey woke up early in the morning to eat some food before they left. He couldn’t find anything to eat except for pie! Usually, Joey would eat cereal, fruit (a pear), or oatmeal for breakfast. After he ate, he and Jimmy went to the pond. On their way there they saw their friend Jack Rabbit. They dove into the water and swam for several hours. The sun was out, but the breeze was cold. Joey and Jimmy got out of the water and started walking home. Their fur was wet, and the breeze chilled them. When they got home, they dried off, and Jimmy put on his favorite purple shirt. Joey put on a blue shirt with red and green dots. The two squirrels ate some food that Joey’s mom, Jasmine, made and went off to bed.,
Why was Joey surprised the morning he woke up for breakfast? There was a T-Rex in his garden False Once upon a time, there was a squirrel named Joey. Joey loved to go outside and play with his cousin Jimmy. Joey and Jimmy played silly games together, and were always laughing. One day, Joey and Jimmy went swimming together 50 at their Aunt Julie’s pond. Joey woke up early in the morning to eat some food before they left. He couldn’t find anything to eat except for pie! Usually, Joey would eat cereal, fruit (a pear), or oatmeal for breakfast. After he ate, he and Jimmy went to the pond. On their way there they saw their friend Jack Rabbit. They dove into the water and swam for several hours. The sun was out, but the breeze was cold. Joey and Jimmy got out of the water and started walking home. Their fur was wet, and the breeze chilled them. When they got home, they dried off, and Jimmy put on his favorite purple shirt. Joey put on a blue shirt with red and green dots. The two squirrels ate some food that Joey’s mom, Jasmine, made and went off to bed.,

How to configure T5 task for MultiRC

.setTask('multirc questions:) followed by answer: prefix for the answer to evaluate, followed by paragraph: and then a series of sentences, where each sentence is prefixed with Sent n:prefix second sentence with sentence2:

Example pre-processed input for T5 MultiRc task:

multirc questions:  Why was Joey surprised the morning he woke up for breakfast?      
answer:             There was a T-REX in his garden.      
paragraph:      
Sent 1:             Once upon a time, there was a squirrel named Joey.      
Sent 2:             Joey loved to go outside and play with his cousin Jimmy.      
Sent 3:             Joey and Jimmy played silly games together, and were always laughing.      
Sent 4:             One day, Joey and Jimmy went swimming together 50 at their Aunt Julie’s pond.      
Sent 5:             Joey woke up early in the morning to eat some food before they left.      
Sent 6:             He couldn’t find anything to eat except for pie!      
Sent 7:             Usually, Joey would eat cereal, fruit (a pear), or oatmeal for breakfast.      
Sent 8:             After he ate, he and Jimmy went to the pond.      
Sent 9:             On their way there they saw their friend Jack Rabbit.      
Sent 10:            They dove into the water and swam for several hours.      
Sent 11:            The sun was out, but the breeze was cold.      
Sent 12:            Joey and Jimmy got out of the water and started walking home.      
Sent 13:            Their fur was wet, and the breeze chilled them.      
Sent 14:            When they got home, they dried off, and Jimmy put on his favorite purple shirt.      
Sent 15:            Joey put on a blue shirt with red and green dots.      
Sent 16:            The two squirrels ate some food that Joey’s mom, Jasmine, made and went off to bed.      

Task 12 WiC - Word sense disambiguation

Decide for two sentences with a shared disambigous word wether they have the target word has the same semantic meaning in both sentences.
This is a sub-task of SuperGLUE.

Predicted disambigous word Sentence 1 Sentence 2
False kill He totally killed that rock show! The airplane crash killed his family
True window The expanded window will give us time to catch the thieves. You have a two-hour window for turning in your homework.
False window He jumped out of the window. You have a two-hour window for turning in your homework.

How to configure T5 task for MultiRC

.setTask('wic pos:) followed by sentence1: prefix for the first sentence, followed by sentence2: prefix for the second sentence.

Example pre-processed input for T5 WiC task:

wic pos:
sentence1:    The expanded window will give us time to catch the thieves.
sentence2:    You have a two-hour window of turning in your homework.
word :        window

Task 13 WSC and DPR - Coreference resolution/ Pronoun ambiguity resolver

Predict for an ambiguous pronoun to which noun it is referring to.
This is a sub-task of GLUE and SuperGLUE.

Prediction Text
stable The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made it pleasant and airy.

How to configure T5 task for WSC/DPR

.setTask('wsc:) and surround pronoun with asteriks symbols..

Example pre-processed input for T5 WSC/DPR task:

The ambiguous pronous should be surrounded with * symbols.

Note Read Appendix A. for more info

wsc: 
The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made *it* pleasant and airy.

Task 14 Text summarization

Summarizes a paragraph into a shorter version with the same semantic meaning.

Predicted summary Text
manchester united face newcastle in the premier league on wednesday . louis van gaal's side currently sit two points clear of liverpool in fourth . the belgian duo took to the dance floor on monday night with some friends . the belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth .

How to configure T5 task for summarization

.setTask('summarize:)

Example pre-processed input for T5 summarization task:

This task requires no pre-processing, setting the task to summarize is sufficient.

the belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth .

Task 15 SQuAD - Context based question answering

Predict an answer to a question based on input context.

Predicted Answer Question Context
carbon monoxide What does increased oxygen concentrations in the patient’s lungs displace? Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.
pie What did Joey eat for breakfast? Once upon a time, there was a squirrel named Joey. Joey loved to go outside and play with his cousin Jimmy. Joey and Jimmy played silly games together, and were always laughing. One day, Joey and Jimmy went swimming together 50 at their Aunt Julie’s pond. Joey woke up early in the morning to eat some food before they left. Usually, Joey would eat cereal, fruit (a pear), or oatmeal for breakfast. After he ate, he and Jimmy went to the pond. On their way there they saw their friend Jack Rabbit. They dove into the water and swam for several hours. The sun was out, but the breeze was cold. Joey and Jimmy got out of the water and started walking home. Their fur was wet, and the breeze chilled them. When they got home, they dried off, and Jimmy put on his favorite purple shirt. Joey put on a blue shirt with red and green dots. The two squirrels ate some food that Joey’s mom, Jasmine, made and went off to bed,'

How to configure T5 task parameter for Squad Context based question answering

.setTask('question:) and prefix the context which can be made up of multiple sentences with context:

Example pre-processed input for T5 Squad Context based question answering:

question: What does increased oxygen concentrations in the patient’s lungs displace? 
context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.

Task 16 WMT1 Translate English to German

For translation tasks use the marian model

How to configure T5 task parameter for WMT Translate English to German

.setTask('translate English to German:)

Task 17 WMT2 Translate English to French

For translation tasks use the marian model

How to configure T5 task parameter for WMT Translate English to French

.setTask('translate English to French:)

18 WMT3 - Translate English to Romanian

For translation tasks use the marian model

How to configure T5 task parameter for English to Romanian

.setTask('translate English to Romanian:)

C-K-Loan commented 3 years ago

Discussion started here https://github.com/JohnSnowLabs/spark-nlp/discussions/2105