About the inference times reported in Figure 4 and Table 3

CircleRadon / TokenPacker

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".

214 stars 9 forks source link

About the inference times reported in Figure 4 and Table 3 #13

Open Osilly opened 2 months ago

Osilly commented 2 months ago

Hello, I would like to know if the inference times reported in Figure 4 are measured under NO KV cache? While the "TPS" results in Table 3 are prefill time (first token inference time)?

AAbathur commented 2 months ago

I'm also wonder how the TPS is calculated, can you provide a more detail description about it. Why I evaluate the official LLaVA-1.5 model with 1 A100 40GB GPU and get the average TPS (generated_token_number / generated_time) is about 29 (in the paper, it is 4.9). Is there something wrong?

LiWentomng commented 2 months ago

@Osilly @AAbathur Hello, we use the below codes to get tokens per sencond( TPS):

total_time = 0 
total_tokens = 0
for index, item in tqdm(questions.iterrows(), ..)
     ...
     start = time.time()
     with torch.inferecne_model():
          output_ids = model.generate(input_ids, images, ..., use_cache=True)
     ...
    end=time.time()
     ...
    tokens = output_ids.shape[1]-input_token_len
    total_tokens+=tokens
    total_time+=end-start

 token_per_second = total_tokens/total_time

KV cache is used with a single A100(80G) GPU.

Osilly commented 2 months ago

@LiWentomng Thanks for your response! I hope to know why the inference times in Figure 4 between LLaVA-TokenPacker and official LLaVA-1.5 have significant gap? In our experiments, the image token reduction often only accelerates the prefill stage (first token generation) and has almost no impact on subsequent generations when using kv cache (most overhead is in linear layers). This issue is also discussed in pkunlp-icler/FastV#22 . Can you provide more details?