Knowledgator / TurboT5

Truly flash T5 realization!
54 stars 4 forks source link

Why it changes pytorch version #5

Open Oxi84 opened 2 months ago

Oxi84 commented 2 months ago

Whu it changes pytorch version and installs different cuda on the system?

This would break most peoples's environments actually, because there can be only one cuda version on the Ubuntu, and it has to match the one in the environment.

Oxi84 commented 2 months ago

Also it is slower than the default, here is one example:

`from turbot5 import T5ForConditionalGeneration, T5Config from transformers import T5Tokenizer import torch import time

Initialize the tokenizer and model

tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-large") # Use smaller model model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-large", attention_type='flash', # Specify attention type use_triton=True).to('cuda')

Enable mixed precision

scaler = torch.cuda.amp.GradScaler()

List of input sentences for translation

input_texts = [ "translate English to German: How old are you?", "translate English to French: I am learning how to use transformers.", "translate English to Spanish: This is a test of T5 with Flash attention.", "translate English to Italian: The sky is clear today.", "translate English to Portuguese: I like to play soccer on weekends." ]

Tokenize the input sentences (process smaller batches if needed)

input_ids = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True).input_ids.to('cuda')

Function to measure execution time

def measure_time(func): start_time = time.time() result = func() end_time = time.time() return result, end_time - start_time

Number of repetitions

num_repetitions = 5 total_time = 0.0

Loop to repeat the translation process 5 times

for i in range(num_repetitions): with torch.cuda.amp.autocast(): # Enable mixed precision for memory efficiency outputs, exec_time = measure_time(lambda: model.generate(input_ids)) total_time += exec_time

# Decode and print the translated outputs
translated_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
print(f"Iteration {i+1}:")
for input_text, translated_text in zip(input_texts, translated_texts):
    print(f"Input: {input_text}")
    print(f"Translated Output: {translated_text}")

Calculate and print the average execution time

average_time = total_time / num_repetitions print(f"Average Execution Time: {average_time:.4f} seconds")`