Closed TheSeamau5 closed 1 year ago
Hi @TheSeamau5 ,
How long does it hangs ? Without a warmup phase, model.generate
takes several minutes to finish.
I've tried on my side with this model on A10, after a warm up phase of ~5 minutes, subsequent calls take less than a second.
Hi @jonathlela, thanks you for your swift response.
I didn't realize that the warmup phase took multiple minutes. I guess I don't really understand how it works.
So, tried it again on A10 and got 443.72s (7min 24s) for the first call of
model_output = model.generate(
**inputs,
min_length=22, max_length=22
)
I noticed several times that it is inconsistent on whether or not subsequent calls take 7min or take 0.1s.
Critically, I have failed to run it with max_length=512
within a reasonable time frame (< 30 min) even once, which at the end of the day is what I'm trying to do.
Ok, so I rewrote the script to first pass a warmup prompt to the model and then pass other different prompts to the model.
It looks like re-running the model on the warmup prompt is fast but re-running the model on a new prompt is slow.
Is this how the library is supposed to be used? I was trying to follow this: https://github.com/ELS-RD/kernl/blob/main/tutorial/t5%20e2e.ipynb
Solution I'm looking for is something for which, sure, I can spend any amount of time once at "build time" running all sorts of warmup phases and optimizations, but then after that, the model should be fast(er) and will only see new inputs.
Thank you for your help, and I apologize if it feels like I didn't understand how the library works, I am new to optimizing PyTorch models.
# Standard Library imports
import os
import sys
import time
# Third-party Library imports
from rich import print
from rich.markdown import Markdown
from rich.syntax import Syntax
import torch
import torch._dynamo as torchdynamo
from transformers import AutoTokenizer, AutoModelForCausalLM, T5ForConditionalGeneration
from kernl.model_optimization import optimize_model
from tqdm import tqdm
# Model Name
MODEL_NAME = "Salesforce/codet5-large-ntp-py"
# Device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Download tokenizer from HuggingFace
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Download raw model from HuggingFace
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME).eval().cuda()
# default cache size needs to be increased to store the many graphs with generative models
torchdynamo.config.cache_size_limit = 512
# Optimize the model with kernl
optimize_model(model.encoder)
optimize_model(model.decoder)
# Function to generate a completion
def lm(prompt: str, **kwargs) -> str:
# Tokenize the input
inputs = tokenizer(
prompt,
return_tensors="pt",
# pad_to_multiple_of=8,
# padding=True
).to(device)
# Compute the generation
with torch.inference_mode(), torch.cuda.amp.autocast():
torch.cuda.synchronize()
model_output = model.generate(
**inputs,
**kwargs
)
torch.cuda.synchronize()
# Decode the output
completion = tokenizer.decode(model_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
# Return the output
return completion
# This is probably a hack
os.environ["TOKENIZERS_PARALLELISM"] = "true"
prompt1 = """
# List all currently running unencrypted EC2 instances
# An unencrypted EC2 instance is an instance with at least one unencrypted EBS volume
# Step 1: List all unencrypted EBS volumes and get the list of attached EC2 instances
# Step 2: Return the set of unique EC2 instances with at least one unencrypted EBS volume
import boto3
""".strip()
prompt2 = """
# Retrieve list of buckets from S3
import boto3
""".strip()
warmup_prompt = """
# Retrieve list of instances from EC2
import boto3
"""
# Prompts we will test with
prompts = [
prompt1,
prompt2
]
#############
# MAIN LOOP #
#############
# Main Warm-up phasea
# warmup (IRL, encoder and decoder should be warmed each on their own)
print(Markdown("# Warmup Phase"))
start = time.perf_counter()
lm(warmup_prompt, min_length=22, max_length=22)
print(f" - Warmup completed in: {time.perf_counter() - start}")
# Second warm-up
print(Markdown("# Second warm-up"))
for _ in tqdm(range(10), desc="Warm-up runs"):
lm(warmup_prompt, min_length=22, max_length=22)
# Actual Run
print(Markdown("# Generations"))
for prompt in tqdm(prompts, desc="Generations"):
# Compute the completion
completion = lm(
prompt,
# max_length=512
min_length=22, max_length=22
)
Btw, the output I'm getting (output completions are correct)
(ubuntu-py3.9) ubuntu@152-70-121-233:~$ poetry run python kernl_repro2.py
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Warmup Phase ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
/home/ubuntu/.cache/pypoetry/virtualenvs/ubuntu-zk_aSFMD-py3.9/lib/python3.9/site-packages/torch/cuda/graphs.py:82: UserWarning: The CUDA Graph is empty. This ususally means that the graph was attempted to be captured on wrong device or stream. (Triggered internally at ../aten/src/ATen/cuda/CUDAGraph.cpp:191.)
super(CUDAGraph, self).capture_end()
- Warmup completed in: 404.04018353799984
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Second warm-up ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Warm-up runs: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.80it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Generations ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Generations: 0%| | 0/2 [00:00<?, ?it/s]/home/ubuntu/.cache/pypoetry/virtualenvs/ubuntu-zk_aSFMD-py3.9/lib/python3.9/site-packages/torch/cuda/graphs.py:82: UserWarning: The CUDA Graph is empty. This ususally means that the graph was attempted to be captured on wrong device or stream. (Triggered internally at ../aten/src/ATen/cuda/CUDAGraph.cpp:191.)
super(CUDAGraph, self).capture_end()
Generations: 50%|█████████████████████████████████████████████████████████████████████████████████▌ | 1/2 [05:24<05:24, 324.80s/it]/home/ubuntu/.cache/pypoetry/virtualenvs/ubuntu-zk_aSFMD-py3.9/lib/python3.9/site-packages/torch/cuda/graphs.py:82: UserWarning: The CUDA Graph is empty. This ususally means that the graph was attempted to be captured on wrong device or stream. (Triggered internally at ../aten/src/ATen/cuda/CUDAGraph.cpp:191.)
super(CUDAGraph, self).capture_end()
Generations: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [11:29<00:00, 344.52s/it]
Maybe your new prompts have a different size from your prompts in the warm-up phase. The warm-up phase should capture every shape of your actual input. For example, if your maximum prompt size is 32 tokens, during the warm-up phase you should have a prompt from 0-8 tokens, 9-16 tokens, 17-24 tokens and 25-32 tokens.
Feel free to reopen if needed.
Description
I tried to call
optimize_model
on the CodeT5 model: https://huggingface.co/Salesforce/codet5-large-ntp-pyInstead, the call to
model.generate
hangs.Steps to reproduce
Code to reproduce
Expected Behavior
If you comment out the
optimize_model
lines, you get an answer, which is expectedActual Behavior
Error after keyboard interrupt
Your environment
Operating system and version: Ubuntu 20.04.5 LTS
Python version: 3.9.16
Python package manager: pip 23.0
! nvidia-smi
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+