Results using ST-P3 and VAD metrics

wljungbergh commented 4 months ago

Hi,

As mentioned here you are not using the same definition of the L2 metric as other works (e.g., VAD or ST-P3).

However, you present UniADs numbers using your new metric against ST-P3s numbers using the old metric. This caused us (and surely many others) to assume that UniAD uses the same definition as in ST-P3 (for example VAD makes this assumption when comparing to your method - see here).

This greatly underrepresents UniADs performance, unfortunately. To reduce confusion, it would greatly help if you present your numbers using both definitions.

We ran the numbers for you, and not so surprisingly, your scores improved a lot. Especially at the longer horizons. We include VAD's numbers to highlight that the definition makes a big difference when comparing methods against each other.

Here are your displacement values when using both metric definitions:	Method	L2 (m) 1s	L2 (m) 2s
ST-P3	1.33	2.11	2.90
UniAD (your metric)	0.48	0.96	1.65
UniAD (old ST-P3 metric)	0.42	0.64	0.91
VAD-Tiny	0.46	0.76	1.12
VAD-Base	0.41	0.70	1.05

FYI, we simply changed the code from

for i in range(len(value)):
    row_value.append("%.4f" % float(value[i]))

to

for i in range(len(value)):
    row_value.append("%.4f" % float(value[:i+1].mean()))

YTEP-ZHI commented 4 months ago

Hi @wljungbergh, thank you so much for sharing this valuable information. Could you please make a pull request for the code change? I'd like to present your verification in the README of this project for clarification, especially the table about the performance displacement. Your efforts are greatly appreciated. Thanks again.

wljungbergh commented 4 months ago

No worries, glad to help. I'd be happy to make a pull request, but I will take up this again sometime after March 7th for obvious reasons. :writing_hand: :page_facing_up:

wljungbergh commented 3 months ago

Alright! Now I have some time on my hands. Do you want to change the evaluation to match those of SPT3 and VAD? Or do you want it to be configurable?

Also, where would be the most appropriate place to present the results? The TRAIN_EVAL.md file or in the main README?

YTEP-ZHI commented 3 months ago

Hi @wljungbergh ! I prefer it to be configurable. How about distinguishing the two evaluation protocols by the paper name? One can specify the evaluation strategy using evaluation_strategy='stp3' or 'uniad' in their config, and the default value is 'uniad'.

YTEP-ZHI commented 3 months ago

I will update the README and other instruction files accordingly. No need to worry about it.

YTEP-ZHI commented 3 months ago

Also, I want to know whether the misalignment issue also happens on the collision rate metric. Is the stp3 metric cumulative while uniad is non-cumulative? Have you looked into that before?

wljungbergh commented 3 months ago

Yes. They do the same for L2 and collision-rate (i.e., cumulative). As you can see here, they create one planning metric for each of the timesteps they are using and then take the mean over the values they have added (see here). You do this non-cumulative, just as with the L2 definition.

And yes, if we change that definition too, UniAD achives lower CR than other works.

YTEP-ZHI commented 3 months ago

Understood. After modifying the code, could you please make a complete table for both L2 and CR metrics under the stp3 and uniad strategies like the one you posted?

It would be of great help if the table included L2 and CR results for STP3, UniAD, and VAD under the two different evaluation strategies.

wljungbergh commented 3 months ago

Added a PR #173. Cross-posting the table here:

Method	L2 1s (m)	L2 2s (m)	L2 3s (m)	CR 1s (%)	CR 2s (%)	CR 3s (%)
ST-P3 (legacy ST-P3 metric)	1.33	2.11	2.90	0.23	0.62	1.27
UniAD (your current metric)	0.51	0.98	1.65	0.10	0.15	0.61
UniAD (legacy ST-P3 metric)	0.42	0.64	0.91	0.07	0.10	0.22
VAD-Tiny (legacy ST-P3 metric)	0.46	0.76	1.12	0.21	0.35	0.58
VAD-Base (legacy ST-P3 metric)	0.41	0.70	1.05	0.07	0.17	0.41

YTEP-ZHI commented 3 months ago

Great! Thanks for your help.

OpenDriveLab / UniAD

Results using ST-P3 and VAD metrics #166