Closed AlessioNetti closed 6 days ago
Thank you. I can reproduce the issue. I little change the basic_example to help accelerating the reproducing.
import argparse
import torch
import random
import tensorrt_llm.bindings.executor as trtllm
# This example hows to use the python bindings to create an executor, enqueue a
# request, and get the generated tokens.
# First, follow the steps in README.md to generate the engines.
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Executor Bindings Example")
parser.add_argument("--model_path",
type=str,
required=True,
help="Directory containing model engine")
args = parser.parse_args()
# Create the executor.
executor = trtllm.Executor(args.model_path, trtllm.ModelType.DECODER_ONLY,
trtllm.ExecutorConfig(1))
random.seed(1234)
if executor.can_enqueue_requests():
ite_count = 0
while True:
# Create the request.
requests = []
ite_count += 16
for _ in range(16):
input_token_ids = [random.randint(100, 10000) for _ in range(200)]
requests.append(trtllm.Request(input_token_ids=input_token_ids, max_new_tokens=105,
sampling_config=trtllm.SamplingConfig(top_p=0.5, top_k=None, temperature=20.0)))
if ite_count < 6616:
continue
# Enqueue the request.
request_ids = executor.enqueue_requests(requests)
# Wait for the new tokens.
responses = executor.await_responses(request_ids)
for idx, re in enumerate(responses):
output_tokens = re[0].result.output_token_ids[0]
valid_output = all(el >= 0 and el < 200000 for el in output_tokens)
if not valid_output:
print(f"InValid output produced for request {request_ids[idx]}.")
print(f"Output tokens : {output_tokens[200:]}")
exit(-1)
else:
print(f"Valid output produced for request {request_ids[idx]}.")
We are still investigating the reason.
Hi Alessio,
Thank you for finding this bug. We are looking into this issue. In case this bug becomes a bottleneck in your workflow, one workaround is to change the value of variable mIsAirTopP
to false, TRT-LLM will adopt another top-p sampling method. We will try to fix the bug as soon as possible.
Hi @AlessioNetti do u still have further issue or question now? If not, we'll close it soon.
Hi - the bug has been fixed a few versions back, so we can close this.
System Info
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
We noticed that TensorRT-LLM occasionally (~0.01% of requests) generates invalid tokens. The issue can be reproduced using a generic Falcon 7B model via the following:
The
examples/bindings/executor/example_basic.py
script was modified to issue random top-P requests (in batches of 16) until an invalid token is detected in the output. The changes are as in the following:Expected behavior
Requests should always generate valid tokens, that are in the
[0, vocabulary_size)
range.actual behavior
Occasionally, requests will produce invalid tokens that are outside of the model's vocabulary size. Below is an example of the issue under our custom
example_basic.py
script:As it can be seen, one of the tokens is
2147483647
. In other instances we have also observed negative tokens, but always in the billions range - this would suggest an integer overflow issue connected to top-P sampling logic somewhere.additional notes
0.10.0.dev2024041600
, and it is present up until0.10.0.dev2024050700
;