Question about model training

PrithivirajDamodaran / Parrot_Paraphraser

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Apache License 2.0

867 stars 143 forks source link

Hello, your work is wonderful, I'd like to create something like this in my native language (Persian).

Could you please let me know how you trained those T5s?

I have access to translated Quora question pairs, and I think the training process looks like the following

filter similar sentences in the dataset

train a text generation model from sentence 1 to sentence 2

and from sentence 2 to sentence 1

this model is a text2text generation

I mean just training no include postprocessing

is it correct or not?

Definitely preprocessing needed, depending on the dataset you don't need special characters like \n but you need apostrophe etc, so depends on your application. do cleanup
Quora question pairs is not a great dataset for utterance paraphrasing in my case I used only a small slice of it. Because Quora won't have commands only questions. You have to go beyond question paraphrasing.
Once you have the cleaned and preprocessed dataset for variety pretty much any seq2seq model can be used for finetuning. I fine tuned T5, Pegasus and Bart ended up picking the T5 version

PrithivirajDamodaran / Parrot_Paraphraser

Question about model training #2