Issue Description

When executing the command below with batch requests (e.g., using Python multithreading to request the API), an error occurs:

python server_vllm.py --model /path/to/checkpoints/functionary-medium-v3.0 --rope-scaling '{"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 8192}' --rope-theta 500000.0 --enable-grammar-sampling --tensor-parallel-size 8 --port 7782 --enable-prefix-caching

Error Log

The specific error encountered is:

INFO:     [IP Address] - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/path/to/functionary/server_vllm.py", line 65, in create_chat_completion
    return await process_chat_completion
  File "/path/to/functionary/functionary/vllm_inference.py", line 347, in process_chat_completion
    async for res in result_generator:
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 800, in generate
    async for output in self._process_request
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 967, in _process_request
    raise e
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 957, in _process_request
    async for request_output in stream
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 108, in __anext__
    raise result
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 58, in _log_task_completion
    return_value = task.result()
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 651, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/asyncio/tasks.py", line 489, in wait_for
    return fut.result()
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 624, in engine_step
    request_outputs = await self.engine.step_async
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 326, in step_async
    ) = prompt_template.grammar_sample
  File "/path/to/functionary/functionary/prompt_template/llama3_prompt_template_v3.py", line 179, in grammar_sample
    self.update_grammar_sampling_gen_state
  File "/path/to/functionary/functionary/prompt_template/llama3_prompt_template_v3.py", line 218, in update_grammar_sampling_gen_state
    gen_state["curr_text"] = tokenizer.decode(gen_state["curr_tokens"])
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3811, in decode
    return self._decode
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 625, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
TypeError: argument 'ids': 'NoneType' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function
  File "/path/to/miniconda3/envs/vllm-yzm/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/path/to/functionary/server_vllm.py", line 65, in create_chat_completion
    return await process_chat_completion
  File "/path/to/functionary/functionary/vllm_inference.py", line 347, in process_chat_completion
    async for res in result_generator:
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 800, in generate
    async for output in self._process_request
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 948, in _process_request
    stream = await self.add_request
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 696, in add_request
    self.start_background_loop()
  File "/path/to/functionary/functionary/vllm_monkey_patch/async_llm_engine.py", line 560, in start_background_loop
    raise AsyncEngineDeadError
functionary.vllm_monkey_patch.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

An input prompt for test as an example (I have met such errors in different cases and something in common is that the --enable-grammar-sampling flag is used in conjunction with batch requests.)

<|start_header_id|>system<|end_header_id|>                                                                                         

You are capable of executing available function(s) if required.                                                                    
Only execute function(s) when absolutely necessary.                                                                                
Ask for the required input to:recipient==all                                                                                       
Use JSON for function arguments.                                                                                                   
Respond in this format:                                                                                                            
>>>${recipient}                                                                                                                    
${content}                                                                                                                         
Available functions:                                                                                                               
// Supported function definitions that should be called when necessary.                                                            
namespace functions {                                                                                                              

// Perform a search on Wikidata.                                                                                                   
type search_wikidata = (_: {                                                                                                       
// The item to search. It should not be a sentence, but should be an entity's name.                                                
query: string,                                                                                                                     
}) => any;                                                                                                                         

// Use this function at the end of the task handling process. As the user and other agents do not have access to your intermediate 
steps and the solution presented in your calling of `subtask_solver`, you should write the COMPLETE final answer in the `conclusion
` parameter of this function. Include as many information from your exploration as possible.                                       
type submit_task = (_: {
// Use around 400 words to summarize what you have done to handle this task, especially some milestones (such as writing what conte
nt to file xxx, getting what information from web xxx). Present the final answer explicitly in details. Only this conclusion will b
e shown to user, so you must write down enough detailed information that summarize all the things and information you got.
conclusion: string,             
}) => any;                      

// Define subtask and generate its response by yourself if you want to solve subtask rather than generate thought or call other too
ls.                             
type subtask_solver = (_: {                                      
// The brief description of the subtask you want to create and solve by yourself.                                                  
subtask: string,                
// your detailed and self-contained response to the subtask.                                                                       
solution: string,               
}) => any;                      

} // namespace functions<|eot_id|><|start_header_id|>system<|end_header_id|>                                                       

You are WikidataAgent, and here is your profile:                 
WikidataAgent can search the information of entities on Wikidata.                                                                  
<|eot_id|><|start_header_id|>user<|end_header_id|>               

You are asked to complete the following TASK:                    

# Search for Information on Nedoshivina's 2010 Paper and Kuznetzov's Work on Vietnamese Specimens                                  
## Task Inputs (including dialogues and takeaways from PREVIOUS collaboration)                                                     
[system]: The team is collaborating to solve this problem:
Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually deposited? Just give me the city 
name without abbreviations.     
[WebBrowserAgent]: To achieve our goal, we need to locate the specific city where the Vietnamese specimens described by Kuznetzov i
n Nedoshivina's 2010 paper were eventually deposited. I suggest we start by gathering information about the paper itself and any re
positories or institutions mentioned in it. WikidataAgent, could you search for information on Nedoshivina's 2010 paper and Kuznetz
ov's work related to Vietnamese specimens? This might give us clues about the deposition location.
[WikidataAgent]: I will search for information on Nedoshivina's 2010 paper and Kuznetzov's work related to Vietnamese specimens on 
Wikidata. This will help us gather initial clues about the deposition location.                                                    

## Task Description             
Search Wikidata for information on Nedoshivina's 2010 paper and Kuznetzov's work related to Vietnamese specimens. The goal is to ga
ther initial clues about the deposition location of the Vietnamese specimens described by Kuznetzov in the paper.

<|eot_id|><|start_header_id|>user<|end_header_id|>               

Now you must generate your thought and you must not call the tools in this stage. You should respond in the following json format:
```json                         
{                               
    "thought": "your thought"                                    
}                               
```<|eot_id|><|start_header_id|>assistant<|end_header_id|>                                                                         

>>>

It appears that the error is related to the grammar sampling functionality, as it is triggered when the --enable-grammar-sampling flag is used in conjunction with batch requests.

Steps to Reproduce

Set up the Functionary model using the command provided above. Send multiple requests simultaneously using a batch request method, such as Python multithreading.

Observations

The error seems to be associated with the grammar_sample method in llama3_prompt_template_v3.py. The issue arises when decoding tokens, resulting in a NoneType object being interpreted incorrectly.

Request

Could you please investigate this issue? Any guidance or potential fixes would be greatly appreciated.

It seems fine when not using grammar sampling to raise multi-thread requests, but the effectiveness of tool calls has decreased...

Hi @Luffyzm3D2Y, Can you provide the Inputs (list of messages and tools) that you found enabling grammar sampling gave better results? Because from our test, enabling/disabling grammar sampling gave almost the same result?

@khai-meetkai It seems not relevent to batch request, because I observed once the same type of error when thread is set to 1. The complete prompt in the bad case is :

text = '<|start_header_id|>system<|end_header_id|>                                                                                                                                         \n                                                                                                                                                                                   \nYou are capable of executing available function(s) if required.                                                                                                                    \nOnly execute function(s) when absolutely necessary.                                                                                                                                \nAsk for the required input to:recipient==all                                                                                                                                       \nUse JSON for function arguments.                                                                                                                                                   \nRespond in this format:                                                                                                                                                            \n>>>${recipient}                                                                                                                                                                    \n${content}                                                                                                                                                                         \nAvailable functions:                                                                                                                                                               \n// Supported function definitions that should be called when necessary.\nnamespace functions {\n\n// Execute your provided code and return the terminal output. To get the result, you must explicitly print the important information (intermediate results, final results, etc.) in\n your code using `print` function.          \ntype execute_code = (_: {\n// The code to execute. If None, the code from the file specified by filename will be executed. Either code or filename must be provided.\ncode?: string,\n// The file name to save the code or where the code is stored when `code` is None. If None, a file with a randomly generated name will be created. The randomly generated file will\n be deleted after execution. The file name must be a relative path. Relative paths are relative to the working directory.                                                          \nfilename?: string,\n// The working directory for the code execution. If None, a default working directory will be used.\nwork_dir?: string,\n// The language of the code. Default is "python".\nlang?: string,\n}) => any;\n\n// Use this function at the end of the task handling process. As the user and other agents do not have access to your intermediate steps and the solution presented in your calling\n of `subtask_solver`, you should write the COMPLETE final answer in the `conclusion` parameter of this function. Include as many information from your exploration as possible.\ntype submit_task = (_: {\n// Use around 400 words to summarize what you have done to handle this task, especially some milestones (such as writing what content to file xxx, getting what information from we\nb xxx). Present the final answer explicitly in details. Only this conclusion will be shown to user, so you must write down enough detailed information that summarize all the thing\ns and information you got.                  \nconclusion: string,\n}) => any;\n\n// Define subtask and generate its response by yourself if you want to solve subtask rather than generate thought or call other tools.\ntype subtask_solver = (_: {\n// The brief description of the subtask you want to create and solve by yourself.\nsubtask: string,\n// your detailed and self-contained response to the subtask.\nsolution: string,\n}) => any;\n\n} // namespace functions<|eot_id|><|start_header_id|>system<|end_header_id|>\n\nYou are CodeExecutor, and here is your profile:\nCodeExecutor can write and execute codes to solve given questions.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nYou are asked to complete the following TASK:\n```\n\n# Calculate ISBN-10 Check Digit for Tropicos ID of Order Helotiales\n## Task Inputs (including dialogues and takeaways from PREVIOUS collaboration)\n[system]: The team is collaborating to solve this problem:\nCompute the check digit the Tropicos ID for the Order Helotiales would have if it were an ISBN-10 number.\n[WebBrowserAgent]: We have successfully retrieved the Tropicos ID for the Order Helotiales. Now, we need to calculate the check digit using the ISBN-10 formula. The steps are as f\nollows:                                     \n1. Treat the Tropicos ID as a 9-digit number (pad with leading zeros if necessary).\n2. Calculate the check digit using the ISBN-10 formula.\n\nLet\'s discuss the best approach to proceed with this calculation. Should we assign this task to the CodeExecutor agent, or does anyone have other suggestions?\n\n## Task Description\nUsing the retrieved Tropicos ID for the Order Helotiales, calculate the check digit as if it were an ISBN-10 number. The steps are as follows:\n1. Treat the Tropicos ID as a 9-digit number (pad with leading zeros if necessary).\n2. Calculate the check digit using the ISBN-10 formula, which involves the following steps:\n   a. Multiply each of the first nine digits by its position (i.e., the first digit by 1, the second digit by 2, and so on up to the ninth digit by 9).\n   b. Sum the results of these multiplications.\n   c. Compute the modulus 11 of the sum.\n   d. If the result is 10, the check digit is \'X\'. Otherwise, the check digit is the result itself.\n3. Combine the 9-digit Tropicos ID with the calculated check digit to form the complete ISBN-10 number.\n\n```\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nNow you must generate your thought and you must not call the tools in this stage. You should respond in the following json format:\n```json\n{\n    "thought": "your thought"\n}\n```<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n>>>\n\n'

@khai-meetkai It seems not relevent to batch request, because I observed once the same type of error when thread is set to 1. The complete prompt in the bad case is :

text = '<|start_header_id|>system<|end_header_id|>                                                                                                                                         \n                                                                                                                                                                                   \nYou are capable of executing available function(s) if required.                                                                                                                    \nOnly execute function(s) when absolutely necessary.                                                                                                                                \nAsk for the required input to:recipient==all                                                                                                                                       \nUse JSON for function arguments.                                                                                                                                                   \nRespond in this format:                                                                                                                                                            \n>>>${recipient}                                                                                                                                                                    \n${content}                                                                                                                                                                         \nAvailable functions:                                                                                                                                                               \n// Supported function definitions that should be called when necessary.\nnamespace functions {\n\n// Execute your provided code and return the terminal output. To get the result, you must explicitly print the important information (intermediate results, final results, etc.) in\n your code using `print` function.          \ntype execute_code = (_: {\n// The code to execute. If None, the code from the file specified by filename will be executed. Either code or filename must be provided.\ncode?: string,\n// The file name to save the code or where the code is stored when `code` is None. If None, a file with a randomly generated name will be created. The randomly generated file will\n be deleted after execution. The file name must be a relative path. Relative paths are relative to the working directory.                                                          \nfilename?: string,\n// The working directory for the code execution. If None, a default working directory will be used.\nwork_dir?: string,\n// The language of the code. Default is "python".\nlang?: string,\n}) => any;\n\n// Use this function at the end of the task handling process. As the user and other agents do not have access to your intermediate steps and the solution presented in your calling\n of `subtask_solver`, you should write the COMPLETE final answer in the `conclusion` parameter of this function. Include as many information from your exploration as possible.\ntype submit_task = (_: {\n// Use around 400 words to summarize what you have done to handle this task, especially some milestones (such as writing what content to file xxx, getting what information from we\nb xxx). Present the final answer explicitly in details. Only this conclusion will be shown to user, so you must write down enough detailed information that summarize all the thing\ns and information you got.                  \nconclusion: string,\n}) => any;\n\n// Define subtask and generate its response by yourself if you want to solve subtask rather than generate thought or call other tools.\ntype subtask_solver = (_: {\n// The brief description of the subtask you want to create and solve by yourself.\nsubtask: string,\n// your detailed and self-contained response to the subtask.\nsolution: string,\n}) => any;\n\n} // namespace functions<|eot_id|><|start_header_id|>system<|end_header_id|>\n\nYou are CodeExecutor, and here is your profile:\nCodeExecutor can write and execute codes to solve given questions.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nYou are asked to complete the following TASK:\n```\n\n# Calculate ISBN-10 Check Digit for Tropicos ID of Order Helotiales\n## Task Inputs (including dialogues and takeaways from PREVIOUS collaboration)\n[system]: The team is collaborating to solve this problem:\nCompute the check digit the Tropicos ID for the Order Helotiales would have if it were an ISBN-10 number.\n[WebBrowserAgent]: We have successfully retrieved the Tropicos ID for the Order Helotiales. Now, we need to calculate the check digit using the ISBN-10 formula. The steps are as f\nollows:                                     \n1. Treat the Tropicos ID as a 9-digit number (pad with leading zeros if necessary).\n2. Calculate the check digit using the ISBN-10 formula.\n\nLet\'s discuss the best approach to proceed with this calculation. Should we assign this task to the CodeExecutor agent, or does anyone have other suggestions?\n\n## Task Description\nUsing the retrieved Tropicos ID for the Order Helotiales, calculate the check digit as if it were an ISBN-10 number. The steps are as follows:\n1. Treat the Tropicos ID as a 9-digit number (pad with leading zeros if necessary).\n2. Calculate the check digit using the ISBN-10 formula, which involves the following steps:\n   a. Multiply each of the first nine digits by its position (i.e., the first digit by 1, the second digit by 2, and so on up to the ninth digit by 9).\n   b. Sum the results of these multiplications.\n   c. Compute the modulus 11 of the sum.\n   d. If the result is 10, the check digit is \'X\'. Otherwise, the check digit is the result itself.\n3. Combine the 9-digit Tropicos ID with the calculated check digit to form the complete ISBN-10 number.\n\n```\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nNow you must generate your thought and you must not call the tools in this stage. You should respond in the following json format:\n```json\n{\n    "thought": "your thought"\n}\n```<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n>>>\n\n'

I am sure this prompt could easily reproduce the reported error. It seems to be relevant to some special tokens.

Can you provide the Inputs (list of messages and tools) that you found enabling grammar sampling gave better results?

In our experiments, the focus was not on comparing functionary and GPT* models, and it is not easy to extract specific cases from logs where grammar sampling outperformed its absence. We have observed, however, that not using grammar sampling can lead to a decrease in tool call accuracy, occasionally requiring retries and more additional text parsing. For a more detailed observation, I suggest conducting experiments on complex benchmarks, like the GAIA benchmark.

But I obeserved if the final prompt as input to the model is set to the prompt above, the output would generate wrong tool calls without grammar sampling:

{"id":"cmpl-2de1e33bb7fe4aa68ee27099a7491140","object":"chat.completion","created":1720859904,"model":"/path/to/functionary-medium-v3.0","choices":[{"index":0,"message":{"role":"assistant","tool_call_id":null,"content":null,"name":null,"function_call":null,"tool_calls":[{"index":null,"id":"call_C8nLTd5ycoOf3lxdVORloNtN","function":{"name":"I can proceed with calculating the check digit using the ISBN-10 formula. However, I need the actual Tropicos ID for the Order Helotiales to perform the calculation. Could you please provide that information","arguments":"I can proceed with calculating the check digit using the ISBN-10 formula. However, I need the actual Tropicos ID for the Order Helotiales to perform the calculation. Could you please provide that information?"},"type":"function"}]},"finish_reason":"tool_calls"}],"usage":{"prompt_tokens":1014,"total_tokens":1058,"completion_tokens":44}}

So grammar sampling matters. I would appreciate it if you could solve the issue.

@khai-meetkai https://github.com/MeetKai/functionary/blob/1e5019050d8cdd569f7ac394787bc6d2b0d344a1/functionary/prompt_template/llama3_prompt_template_v3.py#L139C9-L173C30

I have found where the bug occurred in this specific case. In the function grammar_sample(), the variable grammar_sampled_token_idcould be still None after grammar sampling in some condition, and it will lead to some None values in sequentially updated gen_state["curr_tokens"], and finally cause the server to crash.

@khai-meetkai It seems not relevent to batch request, because I observed once the same type of error when thread is set to 1. The complete prompt in the bad case is :

text = '<|start_header_id|>system<|end_header_id|>                                                                                                                                         \n                                                                                                                                                                                   \nYou are capable of executing available function(s) if required.                                                                                                                    \nOnly execute function(s) when absolutely necessary.                                                                                                                                \nAsk for the required input to:recipient==all                                                                                                                                       \nUse JSON for function arguments.                                                                                                                                                   \nRespond in this format:                                                                                                                                                            \n>>>${recipient}                                                                                                                                                                    \n${content}                                                                                                                                                                         \nAvailable functions:                                                                                                                                                               \n// Supported function definitions that should be called when necessary.\nnamespace functions {\n\n// Execute your provided code and return the terminal output. To get the result, you must explicitly print the important information (intermediate results, final results, etc.) in\n your code using `print` function.          \ntype execute_code = (_: {\n// The code to execute. If None, the code from the file specified by filename will be executed. Either code or filename must be provided.\ncode?: string,\n// The file name to save the code or where the code is stored when `code` is None. If None, a file with a randomly generated name will be created. The randomly generated file will\n be deleted after execution. The file name must be a relative path. Relative paths are relative to the working directory.                                                          \nfilename?: string,\n// The working directory for the code execution. If None, a default working directory will be used.\nwork_dir?: string,\n// The language of the code. Default is "python".\nlang?: string,\n}) => any;\n\n// Use this function at the end of the task handling process. As the user and other agents do not have access to your intermediate steps and the solution presented in your calling\n of `subtask_solver`, you should write the COMPLETE final answer in the `conclusion` parameter of this function. Include as many information from your exploration as possible.\ntype submit_task = (_: {\n// Use around 400 words to summarize what you have done to handle this task, especially some milestones (such as writing what content to file xxx, getting what information from we\nb xxx). Present the final answer explicitly in details. Only this conclusion will be shown to user, so you must write down enough detailed information that summarize all the thing\ns and information you got.                  \nconclusion: string,\n}) => any;\n\n// Define subtask and generate its response by yourself if you want to solve subtask rather than generate thought or call other tools.\ntype subtask_solver = (_: {\n// The brief description of the subtask you want to create and solve by yourself.\nsubtask: string,\n// your detailed and self-contained response to the subtask.\nsolution: string,\n}) => any;\n\n} // namespace functions<|eot_id|><|start_header_id|>system<|end_header_id|>\n\nYou are CodeExecutor, and here is your profile:\nCodeExecutor can write and execute codes to solve given questions.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nYou are asked to complete the following TASK:\n```\n\n# Calculate ISBN-10 Check Digit for Tropicos ID of Order Helotiales\n## Task Inputs (including dialogues and takeaways from PREVIOUS collaboration)\n[system]: The team is collaborating to solve this problem:\nCompute the check digit the Tropicos ID for the Order Helotiales would have if it were an ISBN-10 number.\n[WebBrowserAgent]: We have successfully retrieved the Tropicos ID for the Order Helotiales. Now, we need to calculate the check digit using the ISBN-10 formula. The steps are as f\nollows:                                     \n1. Treat the Tropicos ID as a 9-digit number (pad with leading zeros if necessary).\n2. Calculate the check digit using the ISBN-10 formula.\n\nLet\'s discuss the best approach to proceed with this calculation. Should we assign this task to the CodeExecutor agent, or does anyone have other suggestions?\n\n## Task Description\nUsing the retrieved Tropicos ID for the Order Helotiales, calculate the check digit as if it were an ISBN-10 number. The steps are as follows:\n1. Treat the Tropicos ID as a 9-digit number (pad with leading zeros if necessary).\n2. Calculate the check digit using the ISBN-10 formula, which involves the following steps:\n   a. Multiply each of the first nine digits by its position (i.e., the first digit by 1, the second digit by 2, and so on up to the ninth digit by 9).\n   b. Sum the results of these multiplications.\n   c. Compute the modulus 11 of the sum.\n   d. If the result is 10, the check digit is \'X\'. Otherwise, the check digit is the result itself.\n3. Combine the 9-digit Tropicos ID with the calculated check digit to form the complete ISBN-10 number.\n\n```\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nNow you must generate your thought and you must not call the tools in this stage. You should respond in the following json format:\n```json\n{\n    "thought": "your thought"\n}\n```<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n>>>\n\n'

I am sure this prompt could easily reproduce the reported error. It seems to be relevant to some special tokens.

@khai-meetkai Hey, I just quickly patched this bug, and it seems to have fixed the issue with grammar sampling. Although the idea and implementation are a bit rough, at least the program won't crash because of this error anymore. Here, I've posted the code snippet, hoping it might be helpful to you. idea: Try including the token ids of options in grammar sampling into delta_token_ids to make sure when grammar sampling the variable grammar_sampled_token_id could be set not None. implementation: step 1: increase the length of logprobs from 200 in default to 10000. (functionary/vllm_inference.py)

if enable_grammar_sampling is False:
        logprobs = None
    else:
        logprobs = 10000

step 2: encode options and include them into the final token id pools for grammar sampling. (functionary/prompt_template/llama3_prompt_template_v3.py)

if grammar_sampled_token_id is None:
            selected_delta_token_ids = delta_token_ids[:200]
            option_token_ids = set()
            for option in options:
                ids = tokenizer.encode(option)[1:]
                for i in ids:
                    if i not in option_token_ids and i in delta_token_ids:
                        option_token_ids.add(i)
            option_token_ids=list(option_token_ids)
            option_token_ids.sort(key=lambda x: delta_token_ids.index(x))
            for option_token_id in option_token_ids:
                if option_token_id not in selected_delta_token_ids:
                    selected_delta_token_ids.append(option_token_id)
            print(f"len(selected_delta_token_ids):{len(selected_delta_token_ids)}")

            for i, sampled_token_ind in enumerate(selected_delta_token_ids):
                sampled_token = tokenizer.decode(
                    [sampled_token_ind], add_special_tokens=False
                )
                # ...

One significant drawback of this implementation is that it might substantially slow down the inference speed (though I haven't investigated the reasons).

If you have a better solution, please let me know. Thank you very much. Additionally, during the implementation, I found grammar sampling quite interesting, seeming to be based on a state transition sampling strategy. Is there any systematic documentation that explains this mechanism, especially how the current state is determined, or could you point me to the relevant code? Thank you very much.

Hi @Luffyzm3D2Y thank you so much for helping to debug this problem.

For step 1, it seemed to be a problem within vLLM. I once raised this issue providing the latency information but no one seemed to figure out why yet. Due to how fast vLLM's codebase has been evolving, I also haven't had the chance to find out what is the reason for the significant latency degradation in recent versions when logprobs = len(tokenizer.get_vocab()). Thus, our team decided on the logprobs value of 200 first to balance between latency and sampling accuracy.

For step 2, that's a good idea to force the sampler to consider the options too by adding them into the list of selected_delta_token_ids. However, I wonder how do we determine which option should appear first if we do not have access to their logprob values? Currently, I see that each option is added in the order that they are in in options, which does not factor in their logprob values.

Regarding the grammar sampling, yes it is based on a state transition strategy. We design a finite-state machine for each prompt template. The FSM chart for functionary-small-v2.5 is here. Functionary-medium-v3.0's FSM is similar to functionary-v2.4's which is presented below:

@jeffrey-fong After I posted my code snippet, I just modify the code a little because I still observed the same error in some other cases. I once set as you said logprobs = len(tokenizer.get_vocab()), and I also found the significant increase of the latency which I still have not found the reasons behind. But it's really slow, like less than 1 token/s for output.

And as your second point about how do we determine which option should appear, I think in theory, if we get the whole logprobs (like logprobs = len(tokenizer.get_vocab())), and sort the token ids from options like option_token_ids.sort(key=lambda x: delta_token_ids.index(x)), the problem could be totally solved. But the neckbottle is the significant cost of latency when we get all the logprobs. And thanks for your explanation to grammar sampling.

I’m wondering if we can handle this error with some exception handling, in case the ideal scenario (i.e., grammar sampling occasionally failing) cannot be achieved, to prevent the entire server from crashing so that we can retry requests to the server.

MeetKai / functionary

Error Occurs When Using Grammar Sampling with Functionary in Batch Requests #223

Issue Description

Error Log

Steps to Reproduce

Observations

Request