Open Osilly opened 2 months ago
I'm also wonder how the TPS is calculated, can you provide a more detail description about it. Why I evaluate the official LLaVA-1.5 model with 1 A100 40GB GPU and get the average TPS (generated_token_number / generated_time) is about 29 (in the paper, it is 4.9). Is there something wrong?
@Osilly @AAbathur Hello, we use the below codes to get tokens per sencond( TPS):
total_time = 0
total_tokens = 0
for index, item in tqdm(questions.iterrows(), ..)
...
start = time.time()
with torch.inferecne_model():
output_ids = model.generate(input_ids, images, ..., use_cache=True)
...
end=time.time()
...
tokens = output_ids.shape[1]-input_token_len
total_tokens+=tokens
total_time+=end-start
token_per_second = total_tokens/total_time
KV cache is used with a single A100(80G) GPU.
@LiWentomng Thanks for your response! I hope to know why the inference times in Figure 4 between LLaVA-TokenPacker and official LLaVA-1.5 have significant gap? In our experiments, the image token reduction often only accelerates the prefill stage (first token generation) and has almost no impact on subsequent generations when using kv cache (most overhead is in linear layers). This issue is also discussed in pkunlp-icler/FastV#22 . Can you provide more details?
Hello, I would like to know if the inference times reported in Figure 4 are measured under NO KV cache? While the "TPS" results in Table 3 are prefill time (first token inference time)?