Closed deep-diver closed 8 months ago
@sayakpaul
I found it is difficult to limit the input tokens < max-input-token-length
:
For both TGI and locally running models, we can count the number of input tokens and trim them down < max-input-token-length
just like how you did from the previous notebook.
However, it loses some important information. For instance, <|system|>
, <|user|>
, and <|assistant|>
special tokens gives the model signals. With naively trimming down [-max-input-token-length:]
, we lose <|system|>
part.
Even if we keep <|system|>
part and trim the rest, we don't want to lose <|user|>
and <|assistant|>
special tokens as well. Meaning, the ideal final output should be something like below:
<|system|>
ALWAYS KEEP THIS PART
<|user|>
.....MIGHT NEED TO BE TRIMMED WITHIN <|user|> section, but we should keep <|user|> special token.
<|assistant|>
.....
<|user|>
WHATEVER USER's say
The main obstacle is to keep special tokens, and trim down the actual contents inside each special tokens (possible just remove the whole thing if the len(content) becomes zero after trimming.
[{"role": ..., "content": ...}, ...]
. This is necessary since OpenAI SDK only allows us to input the messages in such format.
HuggingFaceH4/zephyr-7b-beta
, it has <|system|>
, <|user|>
, and <|assistant|>
special tokens. However, for NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO
, it has only <|im_start|>
special token. etc.,What do you think? I mean we don't need to tackle this issue within this project, but wanted to discuss about what I have found so far. If you already know the solution for this, please let me know!
Do you want to post it on the fellows channel and tag me?
This PR was tested with both
transformers
on M3 Macbook Pro Max)Some changes are for introducing Async feature. The next step is to build Gradio app.