lmsys/fastchat-t5-3b-v1.0: inconsistent generated output with converted model

Matthieu-Tinycoaching commented 1 year ago

Hi,

I tried to convert and use the lmsys/fastchat-t5-3b-v1.0 model, which is an open-source chatbot trained by fine-tuning Flan-t5-xl (3B parameters) on user-shared conversations collected from ShareGPT.

I converted with default and int8 quantization without error message. But, when trying to use the converted models for generation:

model_id = "lmsys/fastchat-t5-3b-v1.0"
beam_size=1
model_path = "/home/matthieu/Deployment/CTranslate2/fastchat-t5-3b-v1.0/default"

context = "Des observations de 2015 par la sonde Dawn ont confirmé qu'elle possède une forme sphérique, à la différence des corps plus petits qui ont une forme irrégulière. Sa surface est probablement composée d'un mélange de glace d'eau et de divers minéraux hydratés (notamment des carbonates et de l'argile), et de la matière organique a été décelée. Il semble que Cérès possède un noyau rocheux et un manteau de glace. Elle pourrait héberger un océan d'eau liquide, ce qui en fait une piste pour la recherche de vie extraterrestre. Cérès est entourée d'une atmosphère ténue contenant de la vapeur d'eau, dont deux geysers, ce qui a été confirmé le 22 janvier 2014 par l'observatoire spatial Herschel de l'Agence spatiale européenne."

question = "Quelle caractéristique possède Cérès qui rendrait la vie extraterrestre possible ?"

input_text = [f"Given the context please answer the question. Use only French words. Context: {context}; Question: {question}; Answer:"]

translator = ctranslate2.Translator(model_path, device="cuda")
tokenizer = T5Tokenizer.from_pretrained(model_id)

input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text[0]))

results = translator.translate_batch(source=[input_tokens], beam_size=beam_size)
output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens), spaces_between_special_tokens=False)

print(output_text)

I got the following inconsistent generated output: ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

While with hugging face model I got: ["Cérès pourrait héberger un océan d'eau liquide.\n"]

Any advice on this error?

aamir-s18 commented 1 year ago

I tried the same thing, ctranslate works without any issues for this T5 models: https://huggingface.co/declare-lab/flan-sharegpt-xl. But it brakes for fastchat's model.

Matthieu-Tinycoaching commented 1 year ago

@guillaumekln any feedback regarding this model?

Best

guillaumekln commented 1 year ago

The tokenizer from the fine-tuned model looks broken to me. See the tokenization difference with the base tokenizer and the different padding token:

>>> import transformers
>>> tokenizer = transformers.T5Tokenizer.from_pretrained("google/flan-t5-xl")
>>> tokenizer.convert_ids_to_tokens(tokenizer.encode("Quelle caractéristique possède Cérès qui rendrait la vie extraterrestre possible ?"))
['▁Quelle', '▁caractéris', 'tique', '▁possède', '▁C', 'é', 'r', 'ès', '▁qui', '▁rend', 'rait', '▁la', '▁vie', '▁extra', 'ter', 'rest', 're', '▁possible', '▁', '?', '</s>']
>>> tokenizer.pad_token
'<pad>'

>>> tokenizer = transformers.T5Tokenizer.from_pretrained("lmsys/fastchat-t5-3b-v1.0")
>>> tokenizer.convert_ids_to_tokens(tokenizer.encode("Quelle caractéristique possède Cérès qui rendrait la vie extraterrestre possible ?"))
['▁Quelle', ' ', '▁caractéris', 'tique', ' ', '▁possède', ' ', '▁C', 'é', 'r', 'ès', ' ', '▁qui', ' ', '▁rend', 'rait', ' ', '▁la', ' ', '▁vie', ' ', '▁extra', 'ter', 'rest', 're', ' ', '▁possible', ' ', '▁', '?', '</s>']
>>> tokenizer.pad_token
'[PAD]'

I get the expected output after making these changes:

Change [PAD] to <pad> in the file config.json from the converted model directory
Use the base tokenizer T5Tokenizer.from_pretrained("google/flan-t5-xl")

Matthieu-Tinycoaching commented 1 year ago

Hi @guillaumekln thanks for your feedback!

I would have 2 questions regarding generation options:

1) How are the sampling_topk and sampling_temperature affecting the generated output? 2) How to specify in the prompt to leave a blank answer if the answer cannot be found in the context?

guillaumekln commented 1 year ago

sampling_topk enables random sampling. You may want to read this document: https://huggingface.co/blog/how-to-generate
I don't know. It depends on the model and if it's trained to have this behavior. You should ask the authors of the model.

Matthieu-Tinycoaching commented 1 year ago

Thanks for the tips!

NeonBohdan commented 1 year ago

It works thanks. [PAD] -> <pad>

filipemesquita commented 1 year ago

The tokenizer from the fine-tuned model looks broken to me. See the tokenization difference with the base tokenizer and the different padding token:

@guillaumekln This output from the fastchat-t3-3b tokenizer is expected. The fastchat tokenizer explicitly encodes whitespace as a workaround to Flan T5's inability to represent multiple whitespaces. The fastchat tokenizer also adds tokens for linebreak (\n) and other characters that are ignored by Flan T5's default tokenizer. See: https://github.com/lm-sys/FastChat/issues/1022#issuecomment-1540666091

So using the tokenizer Flan T5 tokenizer doesn't actually fully solve the problem since the fastchat model no longer recognizes multiple whitespaces, linebreaks, and other characters.

It would be great if the fastchat model could be fully supported by ctranslate2.

filipemesquita commented 1 year ago

Changing [PAD] to <pad> in config.json seems enough to fix the converted model. It works with the fastchat-t3-3b tokenizer (no need to use the flan-t5-xl tokenizer).

Don't forget to use the following parameters when decoding: text_output = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens), spaces_between_special_tokens=False, skip_special_tokens=True)

Matthieu-Tinycoaching commented 1 year ago

Hi @filipemesquita it seems however that fastchat-t3-3b tokenizer isn't a fast tokenizer as flan-t5-xl is. This could decrease inference performance.

vasileermicioi commented 1 year ago

Hi @filipemesquita it seems however that fastchat-t3-3b tokenizer isn't a fast tokenizer as flan-t5-xl is. This could decrease inference performance.

tokenization happens before and after the inference and it is 1000x faster, so even if there will be a decrease, only by 0.1%

filipemesquita commented 1 year ago

I agree that tokenization performance is not a significant portion of the overall inference. I think the main negative impact of using the tokenizer from fastchat-t5-3b is that it generates tokens for whitespace, which decreases the total capacity for useful tokens in the context (input tokens).

But in my experiments, the quality of the output is affected by using the tokenizer from flan-t5-xl. So if you are looking for similar quality compared to the model in https://chat.lmsys.org/, you probably want to use the tokenizer created specifically for fastchat-t5-3b.

OpenNMT / CTranslate2

lmsys/fastchat-t5-3b-v1.0: inconsistent generated output with converted model #1220