sequence length and return to go

As a newcomer to the field of RL (Reinforcement Learning) for TSC (Traffic Signal Control), I've recently embarked on a journey to understand the intricacies of this domain. During my exploration, I've encountered a couple of perplexing issues that I'm hoping to gain clarity on:

In my diligent review of the code, particularly for models like DT and other similar models, I've noticed an intriguing aspect. It appears that these models do not utilize sequences for decision-making processes. This observation leads me to believe that, despite their labels, these models function more like conventional models rather than exploiting the sequential decision-making characteristic inherent in typical RL frameworks.
The papers I've studied often discuss the implementation of advanced models like BC (Behavioral Cloning), CQL (Conservative Q-Learning), etc. However, I find the details regarding the implementation of these models quite vague. What puzzles me further is the authors' focus on the models proposed in the articles, while seemingly omitting the models used as offline baselines. This omission is curious and somewhat confusing.

Smart-Trafficlab / TransformerLight

sequence length and return to go #3