Closed Co1lin closed 2 years ago
RouteScenario
is not related to Scenario
in any way so you can ignore that.Thanks for your reply.
leaderboard/data/additional_routes/sample_weather_route.xml
.data
on my machine is 68G
.results/sample_result.json
, I can see RouteScenarios from 0 to 54, but there are only 16 to 25 in my outputs.I had updated the transfuser checkpoint with the newer version (which was submitted to the leaderboard earlier, sometime in July). This newer version is trained on a different dataset that contains multiple weathers, as opposed to the model in the paper which is trained on only clear weather data. This could be a possible reason for lower results since routes_town05_long.xml
contains only clear weather.
The geometric fusion checkpoint is the same one as the paper and it gives a driving score of around 20. This is within the variance reported in the paper (25 ± 4). You can also try running the evaluation multiple times to verify if there is a huge variance or not.
To reproduce the results in the paper, you should train the models on clear_weather_data
.
The sample_result.json
file is from a different evaluation route. I included it just to give an example of the formatting of the results file. You can ignore the scores and other things in this file.
Why the "Avg. route completion" score for geometric fusion checkpoint is only around 20? In the paper it is around 70.
I'm not sure about that. I'd suggest running the evaluation multiple times to get an estimate of the variance. Since the evaluation on long routes takes quite some time, you could try reproducing the scores on routes_town05_short.xml
which should be much faster.
Ok, I get it. I will have a try. Thank you!
Which version of CARLA are you using? Also, are you using the leaderboard code from this repo or the official CARLA leaderboard repo?
We use ./setup_carla.sh
, so I guess it is 0.9.10.1?
FYI. After running ./CarlaUE4.sh --world-port=2000 -opengl
, I saw some output:
4.24.3-0+++UE4+Release-4.24 518 0
Disabling core dumps.
That is fine.
Hi! I uploaded the pre-trained model to the CARLA leaderboard, but I got poor results.
Driving score
10.336
Route completion
15.480
Infraction penalty
0.847
Collisions pedestrians
0.000
Collisions vehicles
0.840
Collisions layout
0.450
Red light infractions
0.542
Stop sign infractions
0.000
Off-road infractions
0.527
Route deviations
0.000
Route timeouts
0.011
Agent blocked
31011.783
FYI, the leaderboard session lasted for 72 hours and 51 minutes. I built the docker after switching to leaderboard_submission branch, and put the pre-trained model in the model_ckpt/transfuser
directory.
I wonder that in the leaderboard session, whether the agent was tested under 14 weathers or only one weather? I think for all submissions, the leaderboard will use the same weather conditions and route scenarios. Is it right?
When evaluating on my machine, I also noticed that the agent would stop before the crossroad, whether the traffic light was red or green. I think it was because the light was too small, only about several pixels, so it was pretty hard for perception. Usually, the agent would follow another vehicle with the same driving direction to drive through the crossroad. If there was no other vehicles, it just stopped. Have you noticed similar phenomena?
I find out the new model may don't move on some route like here, maybe is one of the reasons.
@Co1lin Sorry for the late reply. Earlier, when I had updated the transfuser model definition and agent file corresponding to the new checkpoint in the main
branch, I forgot to update them in the leaderboard_submission
branch. I have updated them now, you should be able to get better results.
The official CARLA leaderboard evaluation considers a secret set of weathers and towns, which are unknown to the public for benchmarking purposes.
Hi @ap229997 , I also got a poor evaluation result after retrain-val-test all of your models. I would like to know how did you train-val-test all of your models? did you train on town1,2,3,4,6,7,10 on long,short,tiny routes then validate & test on town5 long,short,tiny routes? or you exclude town5's long route on the validation set for testing (it means validate on town5 short & tiny, then test on town5 long)?
and, just a quick question about these model weights,
mkdir model_ckpt
wget https://s3.eu-central-1.amazonaws.com/avg-projects/transfuser/models.zip -P model_ckpt
unzip model_ckpt/models.zip -d model_ckpt/
rm model_ckpt/models.zip
All of them are trained on 14 kinds of weather data that can be downloaded with 'download_data.sh', right? or is it only the transfuser model? Thanks in advance
Hi @oskarnatan, sorry for the late reply.
All the checkpoints except transfuser are trained on clear_weather_data
. I agree that this is confusing since there are multiple datasets, but I'd suggest using the 14_weathers_data
since weathers are important in the official CARLA leaderboard. I apologize for this confusion. In our case, it was important to decouple the effect of weathers and scenarios, so we created multiple datasets.
For training, we use data collected from town1,2,3,4,6,7,10 tiny & short routes (not long routes), we validate offline using data from town5 short routes and then test online using town5 short and long routes.
@ap229997 , I see. Well noted, thank you.
Hey @ap229997 Aditya, thanks for your wonderful work! A follow-up question to your last note: is there any reason why excluding the long routes during training? For example, the town 1,2,3,4,6,7,10 long routes, which are not used invalidation or testing either. Does this imply that including these long routes during training decreases the performance?
Another follow-up question, which is also related to this issue https://github.com/autonomousvision/transfuser/issues/12:
So, I have retrained a model with the 14_weathers_data
, and run evaluations on routes_town05_long
for 3 times. However, the DS/RC variation is large:
which seems quite different from what you have reported (actually very close to the "Geometric Fusion Results"): 33.15/56.36. In your work, you mentioned that you have trained 3 different models and each run 3 evaluations, which yields 9 results in total. So I can understand that my results are different from what you got in the paper.
However, I'm still a bit curious about why the variation is so large even with a repeated test. Do you think the results that I got are normal?
Hey @ap229997 Aditya, thanks for your wonderful work! A follow-up question to your last note: is there any reason why excluding the long routes during training? For example, the town 1,2,3,4,6,7,10 long routes, which are not used invalidation or testing either. Does this imply that including these long routes during training decreases the performance?
Hi @xinshuoweng, sorry for the late response, I missed this comment somehow. We observed that including long routes skewed the training data distribution heavily (Fig. 1(a) of our supplementary) which led to a drop in performance.
For evaluation, we only consider Town05 long routes in this work but other routes can also be used and we indeed used multi-town routes in another work of ours - NEAT.
@KleinYuan For the results in the paper, we used clear_weather_data
whereas for submitting to the leaderboard, we used 14_weathers_data
so you'd have to train on clear_weather_data
to reproduce the results in the paper.
I think that these results are fine. Given that routes_town05_long
evaluation only consists of clear weather, I'd expect a model trained on only clear weather to perform better than a model trained on multiple weathers.
Also, if you want to consider evaluation in a multi-weather setting, we have provided an evaluation setting in another work of ours - NEAT or you can also directly submit to the leaderboard.
Hi! Thanks for your amazing work. I try to achieve the same performance as the paper shown, but now have some problems.
leaderboard/data/scenarios/town05_all_scenarios.json
, I see that the last one has"scenario_type": "Scenario10"
. But in the evaluation log, I get these:I am wondering that what's the relationship between
Scenario
andRouteScenario
?Result of model_ckpt/geometric_fusion (pretrained)
Result of model_ckpt/transfuser (pretrained)
Result of transfuser (self-trained)
I think there is something wrong, because they are significantly poorer than the results in your paper. Could you find any possible mistakes?
Evaluation script used for pre-trained transfuser model: