Dependency of trajectory evaluation on the presence of first N possible poses

YoshuaNava commented 2 years ago

Hi, I just submitted two trajectories for the same sequence with a slight delay in initialization between them, and obtained very different results. I'm aware that the competition expects accurate and dense trajectories, but unaware of whether this is an expectation on the trajectory characteristics, or a requirement for evaluation.

Therefore, I wanted to ask whether the evaluation method for the challenge expects the presence of the first N poses, from the start of the bag, or if it rather tries to align whatever trajectory is available to the ground truth.

Thank you in advance.

YoshuaNava commented 2 years ago

Our team implemented a live SLAM solution for the challenge, and during our sprint, we obtained results where the sensor would constantly revisit the control points with a very close tracking error. A priori, we had achieved very good APE, but our trajectories were discontinuous (comparatively lower RPE). After submitting our trajectories to the server, the 'origin', as aligned by the evaluation script, regularly had an error >0. We tried to debug this by looking at the calibrations, making sure to compute an estimate for every Lidar point cloud, improving the smoothness of the pose, increasing and decreasing our estimation rates, adding more constraints to the problem (e.g. to prevent our sensor modalities from contending), improving our estimation methods, etc.

All the above helped to improve our scores, as reported by the Hilti server. But nonetheless, our error at the origin, still remained >0, which baffled us.

For the sake of understanding more, I took a trajectory in which we scored 90 points, and split it into two parts. For the first part, the server computed a score of 57, and for the second, 63, leading to an overall 120 points, higher than the full trajectory score. I assume this happened because our trajectory at that time had a slight tilt variation mid-way through the dataset.

I'm aware that the trajectories are aligned in orientation (Umeyama / least squares) against a reference, and from what I understood, once aligned you compute the pose of the control points and compare them against their real values. With this approach you demand local and global consistency of the trajectory simultaneously, leading to a more complete evaluation than other datasets/challenges. First question: is this how the evaluation actually works?
Second question: Did you also consider 'aligning the trajectory origins' instead of Umeyama? In my opinion using Umeyama implicitly puts more weight on RPE, which penalizes trajectories with good APE but lower rate.
Third question: Would it be possible to get more insights on the APE and RPE of our trajectories against the ground truth? I ask this not as a competitor of the challenge, but out of interest to improve our solution by getting more performance stats.
Fourth question: Will the server remain open for submissions after the challenge? Having the high quality datasets and evaluation framework from the challenge available could be an enabler for SLAM development :slightly_smiling_face:

Thank you in advance.

IamPhytan commented 2 years ago

+1

We found out that reducing the error on outlier points reduces the score, just because of Umeyama and the alignment.

I reccomend to look at evo, which is the Python library that was most probably used in the backend to generate the ATE graphs

YoshuaNava commented 2 years ago

@IamPhytan thanks a lot.

I'm very familiar with EVO, but my questions go beyond it. EVO is just an evaluation tool, I'm asking about e.g. why Umeyama alignment was chosen instead of Aligning origins (both options are supported by EVO out of the box)

bedaberner commented 2 years ago

I'm aware that the trajectories are aligned in orientation (Umeyama / least squares) against a reference, and from what I understood, once aligned you compute the pose of the control points and compare them against their real values. With this approach you demand local and global consistency of the trajectory simultaneously, leading to a more complete evaluation than other datasets/challenges. First question: is this how the evaluation actually works?

yes pretty much, you can look at the evaluation script of the 2021 challange which worked basically the same:

https://github.com/Hilti-Research/Hilti-SLAM-Challenge-2021/blob/master/evaluation-evo/evaluation.py

Second question**: Did you also consider 'aligning the trajectory origins' instead of Umeyama? In my opinion using Umeyama implicitly puts more weight on RPE, which penalizes trajectories with good APE but lower rate.

We did think about that but the problem is getting the orientation of the device in the laserscan (ground truth) frame. We were using marks on the ground where we had laserscan targets but that will only allow you to align your position, not your pose. Since even minor orientation errors at the start will lead to considerable position errors later in the run, we decided to go with Umeyama.

Third question**: Would it be possible to get more insights on the APE and RPE of our trajectories against the ground truth? I ask this not as a competitor of the challenge, but out of interest to improve our solution by getting more performance stats.

What Information would you look for? For APE maybe, RPE definitly not since our ground truth has no orientation information.

Fourth question**: Will the server remain open for submissions after the challenge? Having the high quality datasets and evaluation framework from the challenge available could be an enabler for SLAM development.

The plan is to keep the evaluation server open and update the leaderboard if better solutions are submitted.

Hilti-Research / hilti-slam-challenge-2022

Dependency of trajectory evaluation on the presence of first N possible poses #16