PrithivirajDamodaran / Parrot_Paraphraser

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.
Apache License 2.0
866 stars 141 forks source link

Is it possible to keep the capitalization? #26

Closed danielvoelk closed 2 years ago

danielvoelk commented 2 years ago

All my results are in lowercase. The API test on Huggingface has upper case. How do I enable that? :)

PrithivirajDamodaran commented 2 years ago

HF test runs inference on one model, the paraphrase model. With parrot it uses 3 models. Fx, Text is lowerred for adequacy and other models to give precise scores. You can't compare the outputs directly. Read the documentation to understand, how the framework works :)

Question is: why you need the capitalisation during augmentation ?

danielvoelk commented 2 years ago

Thanks for the quick reply. Ok, I understand.

What model tag do I have to use the paraphrase model and keep the capitalisation on? I didn't find this in the parrot documentation. Or is there another one I'm missing? :)

PrithivirajDamodaran commented 2 years ago

prithivida/parrot_paraphraser_on_T5

danielvoelk commented 2 years ago

I'm using that one, but I still get multiple answers and without capitalization. What do I have to adjust to get the paraphrase model, like on HF?

PrithivirajDamodaran commented 2 years ago

what problem are you trying to solve and why you need the caps? Did you see my question above ?

danielvoelk commented 2 years ago

Ah no, I missed that.

I just googled Python Paraphraser and found Parrot. I'm not solving any real problem, I just need a paraphraser to paraphrase some text in a python code with capitalization.

PrithivirajDamodaran commented 2 years ago

prithivida/parrot_paraphraser_on_T5

Ok, Use this model but not the parrot framework.

danielvoelk commented 2 years ago

How can I use it without the parrot framework?

PrithivirajDamodaran commented 2 years ago
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForSeq2SeqLM.from_pretrained("prithivida/parrot_paraphraser_on_T5")
tokenizer = AutoTokenizer.from_pretrained("rprithivida/parrot_paraphraser_on_T5")
model.to(device)
model.eval()

sentence = "What are the famous places we should not miss in Russia?"

encoding = tokenizer.encode("paraphrase: " + sentence, return_tensors="pt")
input_ids = encoding["input_ids"].to(device)

paraphrases = model.generate(
    input_ids=input_ids,
    max_length=128,
    early_stopping=True,
    num_beams=5,
    num_return_sequences=5

)

print ("Original phrase:", sentence)
for paraphrase in paraphrases:
    print(tokenizer.decode(paraphrase, skip_special_tokens=True,clean_up_tokenization_spaces=True))