Benchmark on Town02 and Town05

vaydingul commented 1 year ago

Hi again :)

So far, I've reimplemented the control branch only part of the TCP. It is basically the version that you have used in the first part of your ablation study. I followed a similar procedure to generate the dataset, except it is collected with Roach code. However, the dataset size is similar to the one that you have used in your ablations.

When I conduct a benchmark with the trained model, it just gives terrible results for Town02 (DS ~ 0.02) but not for Town05 (DS ~ 0.2). However, even for Town05, I am not able to replicate the result in the paper.

Have you ever encountered a similar situation where the model performs relatively better at Town05 but not in Town02? If you have, what was your approach to solve it? Do you have any recommendations?

Thanks in advance! 💯

penghao-wu commented 1 year ago

Is the DS you mention under the scale 0-1 or 0-100? It is reasonable that the agent has better DS in town05 than town02 since the single-lane road in Town02 may lead to more blockings or collisions. But it is not normal to have 0.2 over 0.02. Could you please share more details about your evaluation results like different infraction details? Have you checked the images for the evaluation to find out failure cases? By the way, are you testing the model under Roach's testing suite or under the Leaderboard environment?

vaydingul commented 1 year ago

Is the DS you mention under the scale 0-1 or 0-100?

Sorry for not mentioning it. The scale is 0-1.

I am sharing the detailed results of the benchmarks with the following Google Sheets link: Detailed Benchmark Results

Moreover, you can find some examples from the benchmark videos: LAV - Town 05 LAV - Town 02 Offline Leaderboard - Town 05 Offline Leaderboard - Town 02

By the way, are you testing the model under Roach's testing suite or under the Leaderboard environment?

Finally, I am using the Roach's benchmark environment the to test models. I've also implemented the LAV benchmark strategy to Roach's infrastructure. Therefore, above results are consisted of two parts:

LAV benchmark
Offline Leaderboard benchmark

Thanks a lot!

penghao-wu commented 1 year ago

Which towns do your training data cover? Does your training data contain data in town01 whose layout is similar to town2?

vaydingul commented 1 year ago

Hi,

Sorry for the late reply. Before answering the issue, I just wanted to complete a few more tests. I realized that the CARLA version that I collected dataset and I conducted benchmark were different. Therefore, as a first thing, I solved this issue. Now, the results are better (Town05 DS:0.22, Town02 DS:0.08), but they still do not match yours.

At this point, I think it might be better to give a complete image of what I've done to make everything clear for you.

I collected the dataset on Town01, Town03, Town04, and Town06, as you have mentioned in your paper. Moreover, the model is trained on these towns.
As a base model for your control-branch only part ( the first element in your ablation study), I am using the ready-to-use CILRS-like IL agent provided in the carla-roach repository.
The model accepts camera image, car speed, high-level navigational command, and target location as inputs and outputs the parameters of the action distribution, which is modeled as Beta distribution.
Basically, for the control-branch only part, I have not changed anything from the Roach code. I've just trained the model from scratch and tried to test it on benchmark towns (Town02 and Town05).

I am assuming that you are also very familiar with the Roach repository. So, what do you think might have gone wrong up to that point? In case you are interested, here is my fork.

Additionally, I tried to train a separate model with only Town02 episodes and conduct a benchmark. SURPRISINGLY, the model still performs better at Town05 (DS:0.12) compared to Town02 (DS:0.09).

penghao-wu commented 1 year ago

Have you tried to use the pretrained CILRS model provided by Roach to test its performance? Maybe you could first try to reproduce the results reported in Roach. Also note that we are testing in the leaderboard benchmark environment which is different from the the benchmark environment in Roach. For example, our testing envrionment does not have random npc pedestrians (only some pedestrians as agents for scenarios). We also clip the throttle value of the agent to 0-0.75 following transfuser.

vaydingul commented 1 year ago

Have you tried to use the pretrained CILRS model provided by Roach to test its performance? Maybe you could first try to reproduce the results reported in Roach.

The pretrained IL agents in Roach have also been trained with five extra dagger iterations. Therefore, it is not suitable to compare. That's why I've attempted to train from scratch.

Also, note that we are testing in the leaderboard benchmark environment which is different from the the benchmark environment in Roach.

However, for your ablation study, you still use the routes and weather definition from the LAV paper, right? You use Leaderboard and SRunner to incorporate LAV routes in CARLA server in order to test your model.

For example, our testing envrionment does not have random npc pedestrians (only some pedestrians as agents for scenarios).

Except being agents for some scenarios, there are also pedestrians, right? They are just not random, as far as I understand.

penghao-wu commented 1 year ago

The pretrained IL agents in Roach have also been trained with five extra dagger iterations. Therefore, it is not suitable to compare. That's why I've attempted to train from scratch.

Yes, it contains dagger iterations. I just want to make sure there is nothing wrong with your benchmark environment.

However, for your ablation study, you still use the routes and weather definition from the LAV paper, right?

Yes.

Except being agents for some scenarios, there are also pedestrians, right? They are just not random, as far as I understand.

I think the leadearboard do not contain pedestrains except those defined by scenarios.

vaydingul commented 1 year ago

Yes, it contains dagger iterations. I just want to make sure there is nothing wrong with your benchmark environment.

Yeah, I did this in the beginning. The results are aligned with the one in the Roach paper.

I think the leadearboard do not contain pedestrains except those defined by scenarios.

I see.

Thank you very much for your help. I appreciate it a lot.

OpenDriveLab / TCP

Benchmark on Town02 and Town05 #4