并发推理异常【显存充足】：probability tensor contains either `inf`, `nan` or element < 0

sky505 commented 1 week ago

同一个视频同时发起5次并发的推理，会一直报错：probability tensor contains either inf, nan or element < 0 起初我本来以为是算力不足导致，但是我换成5个不一样的视频，推理，就正常。

5个推理：推理1：视频A + 图片A + 问题1 推理2：视频A + 图片B + 问题2 推理3：视频A + 图片C + 问题3 推理4：视频A + 图片D + 问题4 推理5：视频A + 图片E + 问题5

jklj077 commented 1 week ago

Hi, this could be caused by many things, to better understand your situation, we need more information:

which framework did you use? did you enable flash attention?
can each one of the concurrent requests run individually without error?
how did you implement concurrent inference?

sky505 commented 1 week ago

Hi, this could be caused by many things, to better understand your situation, we need more information:

which framework did you use? did you enable flash attention?

can each one of the concurrent requests run individually without error?

how did you implement concurrent inference?

`

try:
    # Preparation for inference
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
    inputs = inputs.to('cuda')
except Exception as e:
    error_msg = f"{e}"
    print(f"input异常 --> {error_msg}")
    raise e

# Inference
try:
    generated_ids = model.generate(**inputs, max_new_tokens=512)
    generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
    output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    print(f"qianwen模型输出结果 : {output_text}")
except Exception as e:
    error_msg = f"{e}"
    print(f"推理异常 --> {error_msg}")
    raise e

out = ""
if isinstance(output_text, list):
    if len(output_text) != 0:
        out = output_text[0]
# 清空未使用的显存缓存
torch.cuda.empty_cache()

`

单个推理的情况下全部是正常的，就是并发推理出现异常

jklj077 commented 1 week ago

单个推理的情况下全部是正常的

If so, it is unlikely that you're facing an issue with the model.

Still, to check potential coding problems:

which framework did you use? I see swift in the screenshot. if you're using swift, not just transformers, you may need to consider passing the issues to swift.
how did you implement concurrent inference? the code you have shown is for a single request. I assume you didn't implement dynamic batching and use multi-threading?

sky505 commented 1 week ago

单个推理的情况下全部是正常的

If so, it is unlikely that you're facing an issue with the model.

Still, to check potential coding problems:

which framework did you use? I see swift in the screenshot. if you're using swift, not just transformers, you may need to consider passing the issues to swift.

how did you implement concurrent inference? the code you have shown is for a single request. I assume you didn't implement dynamic batching and use multi-threading?

1、如上我贴的代码就是使用的代码，也是报错"probability tensor contains either inf, nan or element < 0" 2、我贴的代码是个封装了推理，我在短时间内调用了5次（相同的视频）这个推理的方法，肯定会有4个报错，1个成功。但是我使用不同的视频，就是正常的 3、我是使用Flask框架实现的web服务，我只是简单的接受到视频地址和指令，传给推理的messages。

jklj077 commented 1 week ago

Provide MWE. Cannot reproduce using our own code with https://github.com/QwenLM/Qwen2-VL/blob/main/web_demo_mm.py.

QwenLM / Qwen2-VL

并发推理异常【显存充足】：probability tensor contains either `inf`, `nan` or element < 0 #184