Open Oxi84 opened 2 months ago
Also it is slower than the default, here is one example:
`from turbot5 import T5ForConditionalGeneration, T5Config from transformers import T5Tokenizer import torch import time
tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-large") # Use smaller model model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-large", attention_type='flash', # Specify attention type use_triton=True).to('cuda')
scaler = torch.cuda.amp.GradScaler()
input_texts = [ "translate English to German: How old are you?", "translate English to French: I am learning how to use transformers.", "translate English to Spanish: This is a test of T5 with Flash attention.", "translate English to Italian: The sky is clear today.", "translate English to Portuguese: I like to play soccer on weekends." ]
input_ids = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True).input_ids.to('cuda')
def measure_time(func): start_time = time.time() result = func() end_time = time.time() return result, end_time - start_time
num_repetitions = 5 total_time = 0.0
for i in range(num_repetitions): with torch.cuda.amp.autocast(): # Enable mixed precision for memory efficiency outputs, exec_time = measure_time(lambda: model.generate(input_ids)) total_time += exec_time
# Decode and print the translated outputs
translated_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
print(f"Iteration {i+1}:")
for input_text, translated_text in zip(input_texts, translated_texts):
print(f"Input: {input_text}")
print(f"Translated Output: {translated_text}")
average_time = total_time / num_repetitions print(f"Average Execution Time: {average_time:.4f} seconds")`
Whu it changes pytorch version and installs different cuda on the system?
This would break most peoples's environments actually, because there can be only one cuda version on the Ubuntu, and it has to match the one in the environment.