Confusing about tokenizer used in inference?

ZGC-LLM-Safety / TrafficLLM

The repository of TrafficLLM, a universal LLM adaptation framework to learn robust traffic representation for all open-sourced LLM in real-world scenarios and enhance the generalization across diverse traffic analysis tasks.

110 stars 15 forks source link

Hi WBSLZF,

Thanks for your valuable insights! There are two choices in TrafficLLM for achieving traffic-domain tokenization:

The first method is combining the text and traffic tokens in one tokenizer to supplement native LLM's tokenization, which is the default implementation in current TrafficLLM codes. This tokenizer contains both traffic-domain task text and traffic data processing abilities.

The second method is constructing two tokenizers in different stages respectively. If you want to realize such division during inference, you can change the code of loading tokenizer in inference.py by creating two new tokenizers trained by yourself.

Hope the above answer can help you.

ZGC-LLM-Safety / TrafficLLM

Confusing about tokenizer used in inference? #19