Why L2 error in V3 version ( which is your latest version) is higher than V2 version?

@frkmac3

V2 version was tested under the setting that is similar to ST-P3 (average over average, but gt occ map is different, code is mainly from here). In V3 we adopted a more standard evaluation which is tested and aligned with the paper-reported scores in UniAD and ST-P3. Therefore, please refer to the latest version V3 for results.

For more context, UniAD and ST-P3 use different evaluation metrics:

Validation samples are different: UniAD evaluates 6019 frames with masking, while ST-P3 drops the first and last several frames in a log (around 4800 test frames). This will have marginal effects on the performance.
GT-Occupancy: UniAD only cares about vehicles in the collision calculation, while ST-P3 takes both vehicles and pedestrians in their computation, and the vehicle numbers are also different in the two implementations. This will cause non-negligible impact on collision.
Average over agverage. UniAD computes L2 and collision at each time-step, while ST-P3 considers the average of all previous time-steps. This will cause a significant difference in all results.

We also release the gt_occ and evaluation code. The results should be aligned with the provided code. If you'd like to compare, please use our code for evaluation.

PointsCoder / GPT-Driver

Why L2 error in V3 version ( which is your latest version) is higher than V2 version? #6