Open WBSLZF opened 1 week ago
Hi WBSLZF,
Thanks for your valuable insights! There are two choices in TrafficLLM for achieving traffic-domain tokenization:
The first method is combining the text and traffic tokens in one tokenizer to supplement native LLM's tokenization, which is the default implementation in current TrafficLLM codes. This tokenizer contains both traffic-domain task text and traffic data processing abilities.
The second method is constructing two tokenizers in different stages respectively. If you want to realize such division during inference, you can change the code of loading tokenizer
in inference.py by creating two new tokenizers trained by yourself.
Hope the above answer can help you.
Hello, I am confused about the tokenizer used in inference. Is the tokenizer trained using the datasets from both stage one and stage two? In stage one, the task is more like a natural language task where we input text without traffic data. In stage two, the input contains traffic data. These two stage has different characteristics. So should we train different tokenizer for different stage?