Problem: we were implicitly adding another bos token inside the vllm engine (in addition to the one in the prompt). This caused double bos tokens for our prompts during evaluation and a slight degradation.
Solution: Modify our api_server.py to ensure that we don't create double bos tokens by not adding special tokens (bos/eos) during the token encoding and only add the bos if it's missing after. We preferred modifying it in the api_server.py to avoid having to specify the tokenizer for api server runs and customizing the prompt based on the model/tokenizer revision.
Alternative considered: Modify api runner to pass in the prompt_token_ids directly. However, this would not have broader compatibility with the open ai server formats, or with the vanilla api server from vllm, since those only expect a prompt of type string and not token ids.
Problem: we were implicitly adding another bos token inside the vllm engine (in addition to the one in the prompt). This caused double bos tokens for our prompts during evaluation and a slight degradation.
Solution: Modify our api_server.py to ensure that we don't create double bos tokens by not adding special tokens (bos/eos) during the token encoding and only add the bos if it's missing after. We preferred modifying it in the api_server.py to avoid having to specify the tokenizer for api server runs and customizing the prompt based on the model/tokenizer revision.
Alternative considered: Modify api runner to pass in the prompt_token_ids directly. However, this would not have broader compatibility with the open ai server formats, or with the vanilla api server from vllm, since those only expect a prompt of type string and not token ids.