autonomousvision / carla_garage

[ICCV'23] Hidden Biases of End-to-End Driving Models
MIT License
203 stars 16 forks source link

Some questions about performance gap #11

Closed Naive-Bayes closed 1 year ago

Naive-Bayes commented 1 year ago

Hi Bernhard Jaeger, thanks for release the model(.pth file) and datasets. I think this repo can serve as a baseline for fair comparison.

I want to run TF++ method on Longest6 benchmark, and the excepted result is DS:72 RC:95 IS:0.74 according to table 6 in paper.

So I use the .pth file from 'pertained_models/longest6/tfpp_all_0' to run the evaluation, and the result is DS:57.58 RC:85.25 IS:0.694. When use the .pth file from 'pertained_models/longest6/tfpp_all_1', the result is DS:62.68 RC:90.17 IS:0.7.

These results are all beyond the std in table 6. Could you tell me some details about how can I get the correct result on Longest6?

Naive-Bayes commented 1 year ago

Here is my 'local_evaluation.sh': `export CARLA_ROOT=${1:-/home/shannon/Carla_0.9.10.1} export WORK_DIR=${2:-/home/shannon/carla_garage}

export CARLA_SERVER=${CARLA_ROOT}/CarlaUE4.sh export PYTHONPATH=$PYTHONPATH:${CARLA_ROOT}/PythonAPI export PYTHONPATH=$PYTHONPATH:${CARLA_ROOT}/PythonAPI/carla export PYTHONPATH=$PYTHONPATH:$CARLA_ROOT/PythonAPI/carla/dist/carla-0.9.10-py3.7-linux-x86_64.egg export SCENARIO_RUNNER_ROOT=${WORK_DIR}/scenario_runner export LEADERBOARD_ROOT=${WORK_DIR}/leaderboard export PYTHONPATH="${CARLA_ROOT}/PythonAPI/carla/":"${SCENARIO_RUNNER_ROOT}":"${LEADERBOARD_ROOT}":${PYTHONPATH}

export SCENARIOS=${WORK_DIR}/leaderboard/data/scenarios/eval_scenarios.json export ROUTES=${WORK_DIR}/leaderboard/data/longest6.xml export REPETITIONS=1 export CHALLENGE_TRACK_CODENAME=SENSORS export CHECKPOINT_ENDPOINT=${WORK_DIR}/results/transfuser_plus_plus_longest6.json export TEAM_AGENT=${WORK_DIR}/team_code/sensor_agent.py export TEAM_CONFIG=${WORK_DIR}/pretrained_models/longest6/tfpp_all_0 export DEBUG_CHALLENGE=0 export RESUME=1 export DATAGEN=0 export SAVE_PATH=${WORK_DIR}/results export UNCERTAINTY_THRESHOLD=0.33 export PORT=2020 export TM_PORT=8015

python3 ${LEADERBOARD_ROOT}/leaderboard/leaderboard_evaluator_local.py \ --scenarios=${SCENARIOS} \ --routes=${ROUTES} \ --repetitions=${REPETITIONS} \ --track=${CHALLENGE_TRACK_CODENAME} \ --checkpoint=${CHECKPOINT_ENDPOINT} \ --agent=${TEAM_AGENT} \ --agent-config=${TEAM_CONFIG} \ --debug=0 \ --resume=${RESUME} \ --timeout=600 \ --port=${PORT} \ --trafficManagerPort=${TM_PORT}`

I made minor modifications from origin 'local_evaluation.sh'. But I think these modifications could not affect the result.

Kait0 commented 1 year ago

Can you share the full statistics? Your parsed results.csv maybe. Your route completions seems to be too low, would need to look at the details, what exactly reduced it.

Kait0 commented 1 year ago

hm the blocked metric was a bit higher 0.19 instead of the average 0.06. the other numbers looked ok. This is not so much in absolute numbers of blocked infractions maybe 5 or so. One thing that stood out to me is that your evaluation is only a single repetition.

The numbers in Table 6 is an average of 9 repetitions (3x evaluation a 3x different training seeds). Std is the training std. I would in your case first complete one of your evaluation runs for model 0 and see what the average result is you get. Sometimes in CARLA these kind of problems are just variance and you waste time looking issues that don't exist.

Other than that make sure you run CARLA with the -opengl option and don't use 30XX GPUs they produce rendering artifacts.

Naive-Bayes commented 1 year ago

hm the blocked metric was a bit higher 0.19 instead of the average 0.06. the other numbers looked ok. This is not so much in absolute numbers of blocked infractions maybe 5 or so. One thing that stood out to me is that your evaluation is only a single repetition.

The numbers in Table 6 is an average of 9 repetitions (3x evaluation a 3x different training seeds). Std is the training std. I would in your case first complete one of your evaluation runs for model 0 and see what the average result is you get. Sometimes in CARLA these kind of problems are just variance and you waste time looking issues that don't exist.

Other than that make sure you run CARLA with the -opengl option and don't use 30XX GPUs they produce rendering artifacts.

OK,I will run the model 0 for 3 repetitions to see the result. (Actually, I have run some times with repetition 1, the RC is still low).

And I haved run CARLA with '-opengl', and my GPU is 1080ti, not 30xx.

Kait0 commented 1 year ago

Tried to reproduce your numbers by rerunning the released code with tfpp_all_0 but I got 74 DS (70.45, 74.78, 76.05 for individual repetitions). Used the garage conda environment, evaluate_routes_slurm.py (used UNCERTAINTY_THRESHOLD=0.33, STOP_CONTROL=1 and set paths). and I used CARLA version 0.9.10.1. My GPUs were 2080ti. Not sure what else could go wrong.

Naive-Bayes commented 1 year ago

I also use the garage Conda environment, Carla 0.9.10.1. But I use local_evaluation.sh to run the evaluation. The difference between local_evaluation.sh and evaluate_routes_slurm.py , I think may following os.environ export DIRECT=1 export UNCERTAINTY_WEIGHT=1 export UNCERTAINTY_THRESHOLD=0.33 export HISTOGRAM=0 export BLOCKED_THRESHOLD=180 export TMP_VISU=0 export VISU_PLANT=0 export SLOWER=1 export STOP_CONTROL=1 export TP_STATS=0 export BENCHMARK=longest6

May these explicit os variable course the lower result? I will try to add these and rerun the "local_evaluation.sh".

Kait0 commented 1 year ago

Most of these values are are set to the same value by default in the code. Actually the BENCHMARK value is not, that one was missing in the example local_evaluation.sh. Will add it.

Kait0 commented 1 year ago

any update?

Naive-Bayes commented 1 year ago

So sorry for later replay, I have to do some thing others in last two week. I find add these os.environ mentioned above is useful, now I re-run the evaluation and get the results : DS | RC | IS 68.730 | 89.565 | 0.775 68.266 | 88.784 | 0.776 69.826 | 88.029 | 0.775

Although there still have a little performance gap, but I think 3 std is tolerable and we can regard it as reproducible.

Thank for your help!