A framework where a deep Q-Learning Reinforcement Learning agent tries to choose the correct traffic light phase at an intersection to maximize traffic efficiency.
I think the reward is defined as the difference of cumulative wait time between the action intervals. So positive or negative rewards will be recevied.
Reward is sum of cumulative wait time, right? How it is going negative in the graph(after running testing_main.py)?