Reproduction of the closed-loop evaluation

Thinklab-SJTU / Bench2Drive

[NeurIPS 2024 Datasets and Benchmarks Track] Closed-Loop E2E-AD Benchmark Enhanced by World Model RL Expert

Other

1.3k stars 85 forks source link

Reproduction of the closed-loop evaluation #10

Closed Fengtao22 closed 5 months ago

Fengtao22 commented 5 months ago

Hi, first of all, thanks for your quick responses to my previous questions. Now I am able to run your Bench2Drive closed-loop evaluation pipeline. Since my carla environment keeps crashing, I am only able to collect 94 routes results so far.

Based on my current result calculated with the first 88 routes (collected using your config for vad_base_e2e_b2d.py and the provided vad_b2d_base.pth), the driving score is 36.86 which is close to the 39.42 as provided in your paper. However, the success rate way higher than 10% (for my case, total 28 success runs, i.e., status is completed). For total 220 routes, the number of success runs for your case, should be only 22. Can you provide some insight into my result?

Besides that, how many repetitions do you use for your reported result? Thanks!

jiaxiaosong1002 commented 5 months ago

@Fengtao22 We note that those crashing case should be denoted as failing ones as the algorithms keep driving to strange place leading to crash. Besides, you should alwalys run all routes before obtaining any conclusion as different routes have very different difficulty.

Single run, as the variance of short routes is limited.

Fengtao22 commented 5 months ago

@Fengtao22 We note that those crashing case should be denoted as failing ones as the algorithms keep driving to strange place leading to crash. Besides, you should alwalys run all routes before obtaining any conclusion as different routes have very different difficulty.

Single run, as the variance of short routes is limited.

Thanks for your quick response! Yes, I also noticed that the variance is limited, hence single run is good enough. For your comment that "run all routes before obtaining any conclusion", I cannot agree with that. Specifically, my newest experiment collected 110 routes for the b2d checkpoint and now the number of successfully completed routes is 33. Let's say I did not get any other successfully completed routes for the rest of 110 routes, my success rate is 33/220, which is still better than 10%, right? Unless there are weighted process for the final success rate.

jiaxiaosong1002 commented 5 months ago

@Fengtao22 Please read our paper and code carefully. Completed does not mean success. Completed without Infractions (except speed Infraction) is success.

MianHtan commented 4 months ago

@Fengtao22 We note that those crashing case should be denoted as failing ones as the algorithms keep driving to strange place leading to crash. Besides, you should alwalys run all routes before obtaining any conclusion as different routes have very different difficulty.

Single run, as the variance of short routes is limited.

Is there any way to avoid this kind of crash? I am stuck on a route now. Carla keeps crashing, which makes it impossible to complete the evaluation of all routes.

jayyoung0802 commented 4 months ago

resume=True works well. It uses the json file to determine the route id of the evaluation, please keep the file path unchanged. And if one route always crashed(may be caused by agent behavior), you can skip it and 'progress'-1 manually.

MianHtan commented 4 months ago

Does "skip it" means delete the route from the xml file?

jayyoung0802 commented 4 months ago

The 'crash' is caused by the agent's behavior, so this route is marked as failed. In the final statistics, the number of crashes will be reported statistics details. For your convenience in eval, you can choose to temporarily comment. It will not affect the final result.

starlighttt123 commented 2 months ago

resume=True works well. It uses the json file to determine the route id of the evaluation, please keep the file path unchanged. And if one route always crashed(may be caused by agent behavior), you can skip it and 'progress'-1 manually.

hello，what's meaning of the 'progress'-1 manually.? Does it mean manually change the "progress":[2, 55] to[2, 54] for example? And what is the whole process of resume? Are they1. find the crash id and comment it in xml 2. manually change the json file 3. re-run run_evaluation_multi_vad.sh? Does it miss some other operation? Thanks for your reply!