Vamsi995 / Paraphrase-Generator

A paraphrase generator built using the T5 model which produces paraphrased English sentences.
MIT License
310 stars 66 forks source link

Tokenizer issue on Google Collab #5

Closed MastafaF closed 3 years ago

MastafaF commented 3 years ago

Python code:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("./T5_Paraphrase_Paws", use_fast=False)  
model = AutoModelForSeq2SeqLM.from_pretrained("./T5_Paraphrase_Paws")

sentence = "This is something which i cannot understand at all"
text =  "paraphrase: " + sentence + " </s>"
encoding = tokenizer.encode_plus(text,pad_to_max_length=True, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=256,
    do_sample=True,
    top_k=200,
    top_p=0.95,
    early_stopping=True,
    num_return_sequences=5
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    print(line)

Error on Collab:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-c0b69f8e2f6a> in <module>()
      1 from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
      2 
----> 3 tokenizer = AutoTokenizer.from_pretrained("./T5_Paraphrase_Paws", use_fast=False)
      4 model = AutoModelForSeq2SeqLM.from_pretrained("./T5_Paraphrase_Paws")
      5 

4 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py in __init__(self, *args, **kwargs)
     94         else:
     95             raise ValueError(
---> 96                 "Couldn't instantiate the backend tokenizer from one of: "
     97                 "(1) a `tokenizers` library serialization file, "
     98                 "(2) a slow tokenizer instance to convert or "

ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a `tokenizers` library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

Solution tried but unsuccessful: set use_fast to False

tokenizer = AutoTokenizer.from_pretrained("./T5_Paraphrase_Paws", **use_fast=False**)  
MastafaF commented 3 years ago

Alright,

pip install sentencepiece 

And restarting runtime seems to do the job.

Cheers,