MCZhi / Driving-IRL-NGSIM

[T-ITS] Driving Behavior Modeling using Naturalistic Human Driving Data with Inverse Reinforcement Learning

MIT License

192 stars 37 forks source link

Execution of Learned Policy #6

Closed abastola0 closed 2 months ago

abastola0 commented 2 months ago

I'm not sure if the learned reward function is being used to execute the actions while rendering. The rendering seems to work fine but i'm not sure if these actions were the result of policy inferred from learned reward function as I see the exact same code during both training and testing. The actions as seen are not inferred from the learned policy.

for lateral in lateral_offsets: for target_speed in target_speeds:

sample a trajectory

        action = (lateral, target_speed, 5)
        obs, features, terminated, info = env.step(action)

I see the human likeness metric being calculated based on the learned reward function but I'm just not sure if the inference over the learned policy is carried our correctly.

Please clarify if I'm following correctly.

MCZhi commented 2 months ago

The code being listed is for the trajectory sampling process, so it is the same for both training and testing. In training, the sampled trajectories are used to approximate the partition function in max-entropy IRL; in testing, the learned reward function is utilized to score these sampled trajectories and select the optimal one. It's worth noting that we evaluate the human likeness metric in an open-loop manner, which means that the plan would not be executed in the simulator.

abastola0 commented 2 months ago

So it's just the score that you generate but the visualizations that you demonstrate on the paper are not based the learned reward function right. As you just evaluate and not execute the generated trajectories as it is open loop simulation. In paper however you mention you execute it. There is no correspondence between the learned reward function and the executed trajectories so i'm still confused.

MCZhi commented 2 months ago

The selected optimal trajectory, which is based on the learned reward function, can be considered as an execution plan. However, it won't be rolled out in the simulator (because we have already simulated the results in the sampling process) but will be used to measure the difference between the plan and the ground-truth human trajectory, hence open-loop evaluation. Although the rendering in the code is only for the sampling process, you can easily add the rendering after the planning (selection process) to visualize the execution.

action = (selected_lateral, selected_target_speed, 5) 
obs, features, terminated, info = env.step(action)
env.render()

abastola0 commented 2 months ago

Thanks. I will try this.