Reward is oddly bad for ql_4x4 code

Sitting-Down commented 2 years ago

Hello I tried to run my own net and route file using the 4x4 code without changing anything and it outputs really bad waiting time values

which made we want to try and run it using fixed_ts for comparison to see whether the issue was with the net/route or with code and the fixed_ts came out a lot better than the ql one.

did I make a mistake or is it really supposed to be like this?

Sitting-Down commented 2 years ago

Also I would also like to ask what the numbers represent when the simulation has finished 1 run it outputs the Step# but I dont really understand what RT,UPS,TOT,ACT,BU mean?

Thank you

Sitting-Down commented 2 years ago

I also tried to run the A3C code using the same net and route files and am still in the process of debugging and figuring out how to make it work but looking at the code I cant seem to understand what is controlling the amount of times it would train the program?

thank you for any help you can give

LucasAlegre commented 2 years ago

Hi, the problem was the gamma value (which was too high, 0.995). In my last commit, I changed it back to the old value, making it learn well now in the 4x4 grid. Notice that tabular Q-learning, with the default state space, is only feasible for intersections with few lanes and actions. Otherwise, there are too many states for it to learn efficiently.

For the traci outputs, see https://sumo.dlr.de/docs/Simulation/Output/index.html#commandline_output_step-log

There is generally a parameter referring to the total number of steps the algorithm will run. Notice that 1 step for the algorithm corresponds to 5s in the simulation.

Best, Lucas

Sitting-Down commented 2 years ago

Thank you for the clarification!

Sorry to bother but would it be possible for you to look at my code and point out any issues you find with it?

The goal I had was trying to create a RL final model that would be able to give me better waiting time values than the simple phase configuration set by Eclipse Sumo (phase 1-> yellow phase -> phase 2 -> yellow phase, which I assume is what fixed_tl does)

I tried to base it off your a3c code but using the DQN of ray https://pastebin.com/JerccZ6V

Specifically the environment consists of The .net file is a 3 intersection net with one road connecting them. The .rou file is 28 instances of traffic I gathered (AM and PM files of 14 days).

My idea was to train a DQN model with the final DQN model being able to give me a better waiting time graph than the fixed_tl graph.

The flow was supposed to be that I would create the environment using main_02.net.xml and Day1AM.rou.xml, train the model using those for 60 or so iterations, stop it and then restore from the last checkpoint and train the model using Day1PM.rou.xml ,still using main_02.net.xml.

with the culminating final goal of having a model that could better handle traffic than default tl_config given a real life situation.

I would appreciate any advice you can give.

Thank you

Sitting-Down commented 2 years ago

Id also like to clarify that the actions taken by the agent are only changing the length of the green light?

not deciding what road gets the green light,right?

so the code still goes through each phase as dictated by the .net file its just that the green time listed there is controlled by the code?

LucasAlegre / sumo-rl

Reward is oddly bad for ql_4x4 code #97