Open wljungbergh opened 9 months ago
Hi @wljungbergh, thank you so much for sharing this valuable information. Could you please make a pull request for the code change? I'd like to present your verification in the README of this project for clarification, especially the table about the performance displacement. Your efforts are greatly appreciated. Thanks again.
No worries, glad to help. I'd be happy to make a pull request, but I will take up this again sometime after March 7th for obvious reasons. :writing_hand: :page_facing_up:
Alright! Now I have some time on my hands. Do you want to change the evaluation to match those of SPT3 and VAD? Or do you want it to be configurable?
Also, where would be the most appropriate place to present the results? The TRAIN_EVAL.md file or in the main README?
Hi @wljungbergh ! I prefer it to be configurable. How about distinguishing the two evaluation protocols by the paper name? One can specify the evaluation strategy using evaluation_strategy='stp3' or 'uniad'
in their config, and the default value is 'uniad'
.
I will update the README and other instruction files accordingly. No need to worry about it.
Also, I want to know whether the misalignment issue also happens on the collision rate
metric. Is the stp3 metric cumulative while uniad is non-cumulative? Have you looked into that before?
Yes. They do the same for L2 and collision-rate (i.e., cumulative). As you can see here, they create one planning metric for each of the timesteps they are using and then take the mean over the values they have added (see here). You do this non-cumulative, just as with the L2 definition.
And yes, if we change that definition too, UniAD achives lower CR than other works.
Understood. After modifying the code, could you please make a complete table for both L2 and CR metrics under the stp3
and uniad
strategies like the one you posted?
It would be of great help if the table included L2 and CR results for STP3, UniAD, and VAD under the two different evaluation strategies.
Added a PR #173. Cross-posting the table here:
Method | L2 1s (m) | L2 2s (m) | L2 3s (m) | CR 1s (%) | CR 2s (%) | CR 3s (%) |
---|---|---|---|---|---|---|
ST-P3 (legacy ST-P3 metric) | 1.33 | 2.11 | 2.90 | 0.23 | 0.62 | 1.27 |
UniAD (your current metric) | 0.51 | 0.98 | 1.65 | 0.10 | 0.15 | 0.61 |
UniAD (legacy ST-P3 metric) | 0.42 | 0.64 | 0.91 | 0.07 | 0.10 | 0.22 |
VAD-Tiny (legacy ST-P3 metric) | 0.46 | 0.76 | 1.12 | 0.21 | 0.35 | 0.58 |
VAD-Base (legacy ST-P3 metric) | 0.41 | 0.70 | 1.05 | 0.07 | 0.17 | 0.41 |
Great! Thanks for your help.
Hi,
As mentioned here you are not using the same definition of the L2 metric as other works (e.g., VAD or ST-P3).
However, you present UniADs numbers using your new metric against ST-P3s numbers using the old metric. This caused us (and surely many others) to assume that UniAD uses the same definition as in ST-P3 (for example VAD makes this assumption when comparing to your method - see here).
This greatly underrepresents UniADs performance, unfortunately. To reduce confusion, it would greatly help if you present your numbers using both definitions.
We ran the numbers for you, and not so surprisingly, your scores improved a lot. Especially at the longer horizons. We include VAD's numbers to highlight that the definition makes a big difference when comparing methods against each other.
FYI, we simply changed the code from
to