NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.57k stars 973 forks source link

why is ModelRunneCpp await_responses blocked? #2246

Open GooVincent opened 1 month ago

GooVincent commented 1 month ago

System Info

I wanna cancel the request in some case and the cancel_request need to pass the request id, then I call await_responses to obtain it. following is is my code.

what I am using is TensorRT-LLM version: 0.12.0.

    outputs = runner.generate(..., streaming=true..)
    for curr_outputs in outputs:
        print(f'step 1')
        responses = runner.session.await_responses()
        print(f'step 2')
        for response in responses:
            runner.session.cancel_request(response.request_id)
            print(f'step 3')
            runner.session.await_responses()
            print(f'step 4')

I try to test it in several times, sometime it blocked after print 'step 1'? I am confused why this happen. Please anyone could help me?

Who can help?

No response

Information

Tasks

Reproduction

run above code

Expected behavior

no

actual behavior

no

additional notes

no

lfr-0531 commented 1 month ago

We recommend using ModelRunnerCpp in the same way of run.py.

I try to test it in several times, sometime it blocked after print 'step 1'? I am confused why this happen.

This seems that there is no active request after canceling the request, then no response.

GooVincent commented 1 month ago

We recommend using ModelRunnerCpp in the same way of run.py.

I try to test it in several times, sometime it blocked after print 'step 1'? I am confused why this happen.

This seems that there is no active request after canceling the request, then no response.

Then how to know if there is activate request? It will block the entire process, I can't stop it even by kill.

lfr-0531 commented 1 month ago

Then how to know if there is activate request?

stats = runner.session.get_latest_iteration_stats()
for stat in stats:
    print(stat.to_json_str())

If you want to cancel a request, please don't call runner.session.await_responses() out of the runner.generate. You can get the request_ids in runner.generate(link).

for curr_outputs in throttle_generator(outputs,
                                       args.streaming_interval):
    stats = runner.session.get_latest_iteration_stats()
    for stat in stats:
        print(stat.to_json_str())
    runner.session.cancel_request(request_ids[0])
GooVincent commented 1 month ago

Then how to know if there is activate request?

stats = runner.session.get_latest_iteration_stats() for stat in stats: print(stat.to_json_str()) If you want to cancel a request, please don't call runner.session.await_responses() out of the runner.generate. You can get the request_ids in runner.generate(link).

for curr_outputs in throttle_generator(outputs, args.streaming_interval): stats = runner.session.get_latest_iteration_stats() for stat in stats: print(stat.to_json_str()) runner.session.cancel_request(request_ids[0])

the way to obtain the request id showing in the generate, it's also running inner runner.generate. still confused how to fetch the request id out of the runner.generate.

    request_ids = self.session.enqueue_requests(requests)
lfr-0531 commented 1 month ago

You can hack the generate in model_runner_cpp.py to add request_ids to the output.

But actually we don't recommend users using cancel_request when using model_runner_cpp.py.

GooVincent commented 1 month ago

why? what's the purpose for cancel_request.

lfr-0531 commented 1 month ago

model_runner_cpp.py is used to run some simple examples. Of course, we welcome users to make some changes to use it, including cancel_request.

GooVincent commented 1 month ago

then how about ModelRunner, which is py session. is it recommended to production?

lfr-0531 commented 1 month ago

which is py session.

py session is https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/generation.py.

is it recommended to production?

It depends on your needs. Now we recommend to deploy with Triton Inference Server for production, please refer to https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#deploy-with-triton-inference-server.

And we also provide a Python API, please refer to https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api and https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/apps

github-actions[bot] commented 5 days ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."