Bad evaluation results after training

st3lzer commented 11 months ago

Hello! After we discussed my evaluation setup in #176 I trained three models on your provided dataset with the following arguments (which should be the same as in your publication):

"id": "transfuser", "epochs": 41, "lr": 0.0001, "batch_size": 12, "logdir": "/beegfs/work/stelz3r/SyncTransfuser/log/transfuser", "load_file": null, "start_epoch": 0, "setting": "all", "root_dir": "/beegfs/work/stelz3r/SyncTransfuser/transfuser_data/", "schedule": 1, "schedule_reduce_epoch_01": 30, "schedule_reduce_epoch_02": 40, "backbone": "transFuser", "image_architecture": "regnety_032", "lidar_architecture": "regnety_032", "use_velocity": 0, "n_layer": 4, "wp_only": 0, "use_target_point_image": 1, "use_point_pillars": 0, "parallel_training": 1, "val_every": 1, "no_bev_loss": 0, "sync_batch_norm": 0, "zero_redundancy_optimizer": 0, "use_disk_cache": 0

As you suggested I used the epoch 31 and evaluated the three models from epoch 31 together. Unfortunately the result is not very good:

Avg. driving score 30.548999999999996 Avg. route completion 69.244 Avg. infraction penalty 0.448 Collisions with pedestrians 0.025351295535939948 Collisions with vehicles 2.839345100025274 Collisions with layout 0.17745906875157963 Red lights infractions 0.050702591071879896 Stop sign infractions 0.40562072857503917 Off-road infractions 0.4878983582369323 Route deviations 0.0 Route timeouts 0.07605388660781984 Agent blocked 0.5070259107187989

As in #176 the blocked-metric is quite high/ RC low, but the infraction penalty is also very low. Do you have any ideas or suggestions what could be wrong?

Kait0 commented 11 months ago

Hm the training settings seem right. How many GPUs did you use for training? The batch size of 12 is for training jobs that use 8 GPUs. If you use less you need to increase it proportionally.

Otherwise hard to tell. Maybe you can generate some debug videos to see what exactly is going wrong (what kind of blocks do you get for example).

st3lzer commented 11 months ago

For training I am using 2 A100 GPUs, so I will try the batch size of 48. Can that have such an influence on the training outcome?

Yesterday I started another evaluation with three times epoch 41 which seems to perform slightly better.

I will try to create some debug videos tomorrow. Would the json-file from the evaluation also help?

Kait0 commented 11 months ago

Can that have such an influence on the training outcome?

Even just the random seed can have such an influence on the training outcome. Though longest6 as a benchmark is a bit more stable than others because it has quite diverse routes and is testing in distribution. I think the largest difference in seed influence on training that we observed on longest6 was 12 DS for geometric fusion (Table 6). (E.g. on the LAV benchmark i observed up to 30 DS difference due to training seed).

So your 16 DS difference suggests there might be something wrong. You trained with effective batch size of 24 so that could be the issue (usually people use values between 64 and 512 for supervised training), we will see.

st3lzer commented 11 months ago

Then I will just do three trainings again with the batch size of 48 (effectivly 96) and then evaluate them.

One question just to be sure: To reproduce your results (of course with slight deviations), I just do three training runs, take the epoch 31 of each and evaluate them together? I do not have to set a random seed for every training manually?

Kait0 commented 11 months ago

Yes, we don't set the seed explicitly. The code will start from a different seed automatically every time you run it (i think the seed this is usually based on the seconds since 1970).

st3lzer commented 11 months ago

Then thank you for the advice! I will see the results next week and hopefully they will be right this time.

autonomousvision / transfuser

Bad evaluation results after training #180