Open GooVincent opened 1 month ago
We recommend using ModelRunnerCpp in the same way of run.py.
I try to test it in several times, sometime it blocked after print 'step 1'? I am confused why this happen.
This seems that there is no active request after canceling the request, then no response.
We recommend using ModelRunnerCpp in the same way of run.py.
I try to test it in several times, sometime it blocked after print 'step 1'? I am confused why this happen.
This seems that there is no active request after canceling the request, then no response.
Then how to know if there is activate request? It will block the entire process, I can't stop it even by kill.
Then how to know if there is activate request?
stats = runner.session.get_latest_iteration_stats()
for stat in stats:
print(stat.to_json_str())
If you want to cancel a request, please don't call runner.session.await_responses()
out of the runner.generate
. You can get the request_ids
in runner.generate
(link).
for curr_outputs in throttle_generator(outputs,
args.streaming_interval):
stats = runner.session.get_latest_iteration_stats()
for stat in stats:
print(stat.to_json_str())
runner.session.cancel_request(request_ids[0])
Then how to know if there is activate request?
stats = runner.session.get_latest_iteration_stats() for stat in stats: print(stat.to_json_str()) If you want to cancel a request, please don't call
runner.session.await_responses()
out of therunner.generate
. You can get therequest_ids
inrunner.generate
(link).for curr_outputs in throttle_generator(outputs, args.streaming_interval): stats = runner.session.get_latest_iteration_stats() for stat in stats: print(stat.to_json_str()) runner.session.cancel_request(request_ids[0])
the way to obtain the request id showing in the generate, it's also running inner runner.generate
. still confused how to fetch the request id out of the runner.generate
.
request_ids = self.session.enqueue_requests(requests)
You can hack the generate
in model_runner_cpp.py
to add request_ids
to the output.
But actually we don't recommend users using cancel_request
when using model_runner_cpp.py
.
why? what's the purpose for cancel_request.
model_runner_cpp.py
is used to run some simple examples. Of course, we welcome users to make some changes to use it, including cancel_request
.
then how about ModelRunner, which is py session. is it recommended to production?
which is py session.
py session
is https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/generation.py.
is it recommended to production?
It depends on your needs. Now we recommend to deploy with Triton Inference Server for production, please refer to https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#deploy-with-triton-inference-server.
And we also provide a Python API, please refer to https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api and https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/apps
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
I wanna cancel the request in some case and the cancel_request need to pass the request id, then I call await_responses to obtain it. following is is my code.
what I am using is TensorRT-LLM version: 0.12.0.
I try to test it in several times, sometime it blocked after print 'step 1'? I am confused why this happen. Please anyone could help me?
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
run above code
Expected behavior
no
actual behavior
no
additional notes
no