Poor evaluation result - Githubissues

Co1lin commented 2 years ago

Hi! Thanks for your amazing work. I try to achieve the same performance as the paper shown, but now have some problems.

In leaderboard/data/scenarios/town05_all_scenarios.json, I see that the last one has "scenario_type": "Scenario10". But in the evaluation log, I get these:

"meta": {
                "exceptions": [
                    [
                        "RouteScenario_16",
                        0,
                        "Failed - Agent got blocked"
                    ],
                    [
                        "RouteScenario_17",
                        1,
                        "Failed - Agent got blocked"
                    ],
                    [
                        "RouteScenario_18",
                        2,
                        "Failed - Agent got blocked"
                    ],
                    [
                        "RouteScenario_20",
                        4,
                        "Failed - Agent got blocked"
                    ],
                    [
                        "RouteScenario_21",
                        5,
                        "Failed - Agent got blocked"
                    ],
                    [
                        "RouteScenario_22",
                        6,
                        "Failed - Agent got blocked"
                    ],
                    [
                        "RouteScenario_23",
                        7,
                        "Failed - Agent got blocked"
                    ],
                    [
                        "RouteScenario_24",
                        8,
                        "Failed - Agent got blocked"
                    ],
                    [
                        "RouteScenario_25",
                        9,
                        "Failed - Agent got blocked"
                    ]
                ]
            },

I am wondering that what's the relationship between Scenario and RouteScenario?

I evaluated the pre-trained model and a model trained by myself without any modification on your code. The results are:

Result of model_ckpt/geometric_fusion (pretrained)

Result of model_ckpt/transfuser (pretrained)

Result of transfuser (self-trained)

I think there is something wrong, because they are significantly poorer than the results in your paper. Could you find any possible mistakes?

Evaluation script used for pre-trained transfuser model:

#!/bin/bash

export CARLA_ROOT=carla
export CARLA_SERVER=${CARLA_ROOT}/CarlaUE4.sh
export PYTHONPATH=$PYTHONPATH:${CARLA_ROOT}/PythonAPI
export PYTHONPATH=$PYTHONPATH:${CARLA_ROOT}/PythonAPI/carla
export PYTHONPATH=$PYTHONPATH:$CARLA_ROOT/PythonAPI/carla/dist/carla-0.9.10-py3.7-linux-x86_64.egg
export PYTHONPATH=$PYTHONPATH:leaderboard
export PYTHONPATH=$PYTHONPATH:leaderboard/team_code
export PYTHONPATH=$PYTHONPATH:scenario_runner

export LEADERBOARD_ROOT=leaderboard
export CHALLENGE_TRACK_CODENAME=SENSORS
export PORT=2000 # same as the carla server port
export TM_PORT=8000 # port for traffic manager, required when spawning multiple servers/clients
export DEBUG_CHALLENGE=0
export REPETITIONS=1 # multiple evaluation runs
export ROUTES=leaderboard/data/evaluation_routes/routes_town05_long.xml
export TEAM_AGENT=leaderboard/team_code/transfuser_agent.py
export TEAM_CONFIG=model_ckpt/transfuser
export CHECKPOINT_ENDPOINT=results/transfuser_result_1203_V2.json
export SCENARIOS=leaderboard/data/scenarios/town05_all_scenarios.json
export SAVE_PATH=data/expert_TF1203_V2 # path for saving episodes while evaluating
export RESUME=True

python3 ${LEADERBOARD_ROOT}/leaderboard/leaderboard_evaluator.py \
--scenarios=${SCENARIOS}  \
--routes=${ROUTES} \
--repetitions=${REPETITIONS} \
--track=${CHALLENGE_TRACK_CODENAME} \
--checkpoint=${CHECKPOINT_ENDPOINT} \
--agent=${TEAM_AGENT} \
--agent-config=${TEAM_CONFIG} \
--debug=${DEBUG_CHALLENGE} \
--record=${RECORD_PATH} \
--resume=${RESUME} \
--port=${PORT} \
--trafficManagerPort=${TM_PORT}

ap229997 commented 2 years ago

RouteScenario is not related to Scenario in any way so you can ignore that.
Can you tell me which version of the code are you using (when did you clone/fork the repo)? I had updated the code a few times in between so it's possible that the training & evaluation setup has changed. Also, can you tell which dataset did you use to train the transfuser model?

Co1lin commented 2 years ago

Thanks for your reply.

I cloned this repo last month. After git pull, I can only see one change: leaderboard/data/additional_routes/sample_weather_route.xml.
I use the minimal dataset. The total size of directory data on my machine is 68G.
I cannot get the same scores even by using the pre-trained models, including the geometric fusion one and transfuser. That's confusing... Could you see any abnormal things in my evaluation outputs that I have posted before? For example, for all of the ten RouteScenarios, the agent got blocked and failed. Besides, in results/sample_result.json, I can see RouteScenarios from 0 to 54, but there are only 16 to 25 in my outputs.

ap229997 commented 2 years ago

I had updated the transfuser checkpoint with the newer version (which was submitted to the leaderboard earlier, sometime in July). This newer version is trained on a different dataset that contains multiple weathers, as opposed to the model in the paper which is trained on only clear weather data. This could be a possible reason for lower results since routes_town05_long.xml contains only clear weather.

The geometric fusion checkpoint is the same one as the paper and it gives a driving score of around 20. This is within the variance reported in the paper (25 ± 4). You can also try running the evaluation multiple times to verify if there is a huge variance or not.

To reproduce the results in the paper, you should train the models on clear_weather_data.

The sample_result.json file is from a different evaluation route. I included it just to give an example of the formatting of the results file. You can ignore the scores and other things in this file.

Co1lin commented 2 years ago

Why the "Avg. route completion" score for geometric fusion checkpoint is only around 20? In the paper it is around 70.

ap229997 commented 2 years ago

I'm not sure about that. I'd suggest running the evaluation multiple times to get an estimate of the variance. Since the evaluation on long routes takes quite some time, you could try reproducing the scores on routes_town05_short.xml which should be much faster.

Co1lin commented 2 years ago

Ok, I get it. I will have a try. Thank you!

ap229997 commented 2 years ago

Which version of CARLA are you using? Also, are you using the leaderboard code from this repo or the official CARLA leaderboard repo?

Co1lin commented 2 years ago

We use ./setup_carla.sh, so I guess it is 0.9.10.1? FYI. After running ./CarlaUE4.sh --world-port=2000 -opengl, I saw some output:

4.24.3-0+++UE4+Release-4.24 518 0
Disabling core dumps.

ap229997 commented 2 years ago

That is fine.

Co1lin commented 2 years ago

Hi! I uploaded the pre-trained model to the CARLA leaderboard, but I got poor results.

Driving score
10.336
Route completion
15.480
Infraction penalty
0.847
Collisions pedestrians
0.000
Collisions vehicles
0.840
Collisions layout
0.450
Red light infractions
0.542
Stop sign infractions
0.000
Off-road infractions
0.527
Route deviations
0.000
Route timeouts
0.011
Agent blocked
31011.783

FYI, the leaderboard session lasted for 72 hours and 51 minutes. I built the docker after switching to leaderboard_submission branch, and put the pre-trained model in the model_ckpt/transfuser directory.

Co1lin commented 2 years ago

I wonder that in the leaderboard session, whether the agent was tested under 14 weathers or only one weather? I think for all submissions, the leaderboard will use the same weather conditions and route scenarios. Is it right?

Co1lin commented 2 years ago

When evaluating on my machine, I also noticed that the agent would stop before the crossroad, whether the traffic light was red or green. I think it was because the light was too small, only about several pixels, so it was pretty hard for perception. Usually, the agent would follow another vehicle with the same driving direction to drive through the crossroad. If there was no other vehicles, it just stopped. Have you noticed similar phenomena?

Kin-Zhang commented 2 years ago

I find out the new model may don't move on some route like here, maybe is one of the reasons. 2021-12-17_20-28

ap229997 commented 2 years ago

@Co1lin Sorry for the late reply. Earlier, when I had updated the transfuser model definition and agent file corresponding to the new checkpoint in the main branch, I forgot to update them in the leaderboard_submission branch. I have updated them now, you should be able to get better results.

The official CARLA leaderboard evaluation considers a secret set of weathers and towns, which are unknown to the public for benchmarking purposes.

oskarnatan commented 2 years ago

Hi @ap229997 , I also got a poor evaluation result after retrain-val-test all of your models. I would like to know how did you train-val-test all of your models? did you train on town1,2,3,4,6,7,10 on long,short,tiny routes then validate & test on town5 long,short,tiny routes? or you exclude town5's long route on the validation set for testing (it means validate on town5 short & tiny, then test on town5 long)?

and, just a quick question about these model weights,

mkdir model_ckpt
wget https://s3.eu-central-1.amazonaws.com/avg-projects/transfuser/models.zip -P model_ckpt
unzip model_ckpt/models.zip -d model_ckpt/
rm model_ckpt/models.zip

All of them are trained on 14 kinds of weather data that can be downloaded with 'download_data.sh', right? or is it only the transfuser model? Thanks in advance

ap229997 commented 2 years ago

Hi @oskarnatan, sorry for the late reply.

All the checkpoints except transfuser are trained on clear_weather_data. I agree that this is confusing since there are multiple datasets, but I'd suggest using the 14_weathers_data since weathers are important in the official CARLA leaderboard. I apologize for this confusion. In our case, it was important to decouple the effect of weathers and scenarios, so we created multiple datasets.

For training, we use data collected from town1,2,3,4,6,7,10 tiny & short routes (not long routes), we validate offline using data from town5 short routes and then test online using town5 short and long routes.

oskarnatan commented 2 years ago

@ap229997 , I see. Well noted, thank you.

xinshuoweng commented 2 years ago

Hey @ap229997 Aditya, thanks for your wonderful work! A follow-up question to your last note: is there any reason why excluding the long routes during training? For example, the town 1,2,3,4,6,7,10 long routes, which are not used invalidation or testing either. Does this imply that including these long routes during training decreases the performance?

KleinYuan commented 2 years ago

Another follow-up question, which is also related to this issue https://github.com/autonomousvision/transfuser/issues/12: So, I have retrained a model with the 14_weathers_data, and run evaluations on routes_town05_long for 3 times. However, the DS/RC variation is large:

run1: 26.472 / 70.583
run2: 28.967 / 58.315
run3: 19.959 / 76.938
average: 25.13 / 68.61

which seems quite different from what you have reported (actually very close to the "Geometric Fusion Results"): 33.15/56.36. In your work, you mentioned that you have trained 3 different models and each run 3 evaluations, which yields 9 results in total. So I can understand that my results are different from what you got in the paper.

However, I'm still a bit curious about why the variation is so large even with a repeated test. Do you think the results that I got are normal?

ap229997 commented 2 years ago

Hey @ap229997 Aditya, thanks for your wonderful work! A follow-up question to your last note: is there any reason why excluding the long routes during training? For example, the town 1,2,3,4,6,7,10 long routes, which are not used invalidation or testing either. Does this imply that including these long routes during training decreases the performance?

Hi @xinshuoweng, sorry for the late response, I missed this comment somehow. We observed that including long routes skewed the training data distribution heavily (Fig. 1(a) of our supplementary) which led to a drop in performance.

For evaluation, we only consider Town05 long routes in this work but other routes can also be used and we indeed used multi-town routes in another work of ours - NEAT.

ap229997 commented 2 years ago

@KleinYuan For the results in the paper, we used clear_weather_data whereas for submitting to the leaderboard, we used 14_weathers_data so you'd have to train on clear_weather_data to reproduce the results in the paper.

I think that these results are fine. Given that routes_town05_long evaluation only consists of clear weather, I'd expect a model trained on only clear weather to perform better than a model trained on multiple weathers.

Also, if you want to consider evaluation in a multi-weather setting, we have provided an evaluation setting in another work of ours - NEAT or you can also directly submit to the leaderboard.

autonomousvision / transfuser

Poor evaluation result #37