OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.2k stars 282 forks source link

Other differences in the beam search implementation? #1740

Open koren-v opened 1 month ago

koren-v commented 1 month ago

Just to make sure that issue won't be missed, will duplicate my response here:

I faced the same issue. However, when read this response, I thought that if I didn't use any special generation parameters (like no_repeat_ngram_size), I would get the same result. Unfortunately, seems that there are other differences in beam search implementation - or do I miss something?

To reproduce:

package versions: transformers==4.34.0, ctranslate2==3.20.0 (as used here)

  1. convert model
    ct2-transformers-converter --model "google/flan-t5-base" --output_dir "ct2-t5-base"
  2. code snippet:
    
    import torch

from transformers import T5ForConditionalGeneration, AutoTokenizer import ctranslate2

device = torch.device("cuda")

model_name = "google/flan-t5-base" hf_model = T5ForConditionalGeneration.from_pretrained(model_name).eval().to(device) tokenizer = AutoTokenizer.from_pretrained(model_name)

fast_model = ctranslate2.Translator("ct2-t5-base", device="cuda")

text = "translate English to German: physician assistants are medical providers who are licensed to diagnose and treat illness and disease and to prescribe medication"

def get_out(inp, model): inputs = tokenizer(inp, return_tensors="pt") ids = model.generate(**inputs.to(device), num_beams=3, min_length=0, max_length=1024, ) return tokenizer.batch_decode(ids, skip_special_tokens=True)[0]

def get_out_fast(inp, model): source = tokenizer.encode(inp) source = tokenizer.convert_ids_to_tokens(source) results = model.translate_batch([source], beam_size=3, min_decoding_length=0, max_decoding_length=1024) target = results[0].hypotheses[0] return tokenizer.decode(tokenizer.convert_tokens_to_ids(target), skip_special_tokens=True)

res_vanilla = get_out(text, hf_model) res_fast = get_out_fast(text, fast_model)

print("Vanilla output:", res_vanilla) print("Ctranslate output:", res_fast)


Output:

Vanilla output: physician assistants sind medical providers, die zu Diagnose und Behandlung von Krankheiten und Krankheiten und zu Verknüpfen von Medikamenten zu ermitteln. Ctranslate output: physician assistants sind medical providers, die zu Diagnose und Behandlung von Krankheiten und Krankheiten und zu Verknüpfen von Medikamenten zu kaufen sind.

vince62s commented 1 month ago

what if you run the same with beam=1

koren-v commented 1 month ago

Output will be the same:

Vanilla output: physician assistants sind medizinische Versorgungsträger, die ärztlichen Versorgungskräfte benötigen, um Krankheiten und Krankheiten zu diagnostischen und zu behandeln und zu prescriben Medikamenten.
Ctranslate output: physician assistants sind medizinische Versorgungsträger, die ärztlichen Versorgungskräfte benötigen, um Krankheiten und Krankheiten zu diagnostischen und zu behandeln und zu prescriben Medikamenten.
vince62s commented 1 month ago

if you have time maybe you can test CT2 2.24 before these changes: https://github.com/OpenNMT/CTranslate2/blob/39f48f2e843df52245e6c857326e1115bca12b03/CHANGELOG.md?plain=1#L551-L552 and test with/without allow_early_exit and length_penalty

koren-v commented 1 month ago

Ok, I needed to use another model as t5 was not supported in ctranslate2==2.24.0, here is the results I got in my experiments:

Convert model:

ct2-transformers-converter --model "beogradjanka/bart_finetuned_keyphrase_extraction" --output_dir "ct2-bart"

New code snippet:

from itertools import product

import torch

from transformers import BartForConditionalGeneration, AutoTokenizer
import ctranslate2

device = torch.device("cuda")

model_name = "beogradjanka/bart_finetuned_keyphrase_extraction"
hf_model = BartForConditionalGeneration.from_pretrained(model_name).eval().to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

fast_model =  ctranslate2.Translator("ct2-bart", device="cuda")

text = (
    "The core CTranslate2 implementation is framework agnostic. The logic that is specific to each framework is moved "
    "to a conversion step that loads supported models into a unified representation. The weights are then optionally "
    "quantized and saved into an optimized binary format."
)

def get_out(inp, model, length_penalty=1.0, allow_early_exit=None):
    inputs = tokenizer(inp, return_tensors="pt")
    ids = model.generate(**inputs.to(device),
                         num_beams=5,
                         min_length=0,
                         max_length=1024,
                         length_penalty=length_penalty
                         )
    return tokenizer.batch_decode(ids, skip_special_tokens=True)[0]

def get_out_fast(inp, model, length_penalty=1.0, allow_early_exit=False):
    source = tokenizer.encode(inp)
    source = tokenizer.convert_ids_to_tokens(source)
    results = model.translate_batch([source],
                                    beam_size=5,
                                    min_decoding_length=0,
                                    max_decoding_length=1024,
                                    length_penalty=length_penalty,
                                    allow_early_exit=allow_early_exit,
                                    )
    target = results[0].hypotheses[0]
    return tokenizer.decode(tokenizer.convert_tokens_to_ids(target), skip_special_tokens=True)

for lp, aes in product([1.0, 3.0], [False, True]):
    res_vanilla = get_out(text, hf_model, lp, aes)
    res_fast = get_out_fast(text, fast_model, lp, aes)

    print("Predictions are equal:", res_vanilla == res_fast, f", when length_penalty={lp} and allow_early_exit={aes}")
    print("Vanilla output:", res_vanilla)
    print("Ctranslate output:", res_fast)
    print("========================================")

Output:

Predictions are equal: False , when length_penalty=1.0 and allow_early_exit=False
Vanilla output: ctranslate2, framework agnostic, platform agnostic
Ctranslate output: ctranslate2, framework agnostic, framework agnostic
========================================
Predictions are equal: False , when length_penalty=1.0 and allow_early_exit=True
Vanilla output: ctranslate2, framework agnostic, platform agnostic
Ctranslate output: ctranslate2, framework agnostic, framework agnostic
========================================
Predictions are equal: False , when length_penalty=3.0 and allow_early_exit=False
Vanilla output: ctranslate2, framework agnostic, model validation, model conversion
Ctranslate output: ctranslate2, framework agnostic, framework agnostic, model conversion
========================================
Predictions are equal: False , when length_penalty=3.0 and allow_early_exit=True
Vanilla output: ctranslate2, framework agnostic, model validation, model conversion
Ctranslate output: ctranslate2, ctranslate2, framework agnostic, platform agnostic, framework agnostic

Correct me if I did not understand what you mean by saying "with/without allow_early_exit and length_penalty"

minhthuc2502 commented 1 month ago

As Guillaume mentioned before, there are often subtle differences in the way beam search between the frameworks. It could make a slight difference, In my opinion, it looks good in 2 cases.