ZikangZhou / HiVT

[CVPR 2022] HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction
https://openaccess.thecvf.com/content/CVPR2022/papers/Zhou_HiVT_Hierarchical_Vector_Transformer_for_Multi-Agent_Motion_Prediction_CVPR_2022_paper.pdf
Apache License 2.0
601 stars 118 forks source link

Prediction Results for non-agent objects #35

Closed parthjdoshi closed 1 year ago

parthjdoshi commented 1 year ago

Hello,

Thank you for the great work!

Since the model is geared towards predicting the future trajectory of multiple objects in the scene, I ran the pre-trained model to measure the minADE and minFDE for all objects, rather than just the agent object type. On the Argoverse 1.1 validation set, I got the following results, when using a batch size of 1:

DATALOADER:0 VALIDATE RESULTS
{'all_actors_minADE': 1.7695937156677246,
 'all_actors_minFDE': 3.879427433013916,
 'val_minADE': 0.6611008644104004,
 'val_minFDE': 0.9691500067710876,
 'val_minMR': 0.09206525981426239,
 'val_reg_loss': -0.30943363904953003}

As you can see, there's a large discrepancy between the agent-only metrics and the all_actors metrics.

One reason could be that pedestrians and bikes have distinct behavioral patterns that differ from vehicles. Is there a way to mitigate this within the model, or should HiVT be considered a vehicle-centric model?

Carrotsniper commented 1 year ago

I am also interesting in this topic, so in this paper, the reported numbers are only for type "AGENT"? Since I have found there existing 3 types of objects in the dataset.

parthjdoshi commented 1 year ago

Yes, the final validation results are only for the agent of interest. Please take a look at line 134 onwards in the HiVT model file. Link: https://github.com/ZikangZhou/HiVT/blob/main/models/hivt.py#L134

ZikangZhou commented 1 year ago

Hi @parthjdoshi @Carrotsniper ,

How did you calculate the "all_actor" metrics? Have you masked those missing values? The missing values can severely affect the results (see my reply here). In Argoverse 1, the data are quite noisy, and the dataset does not annotate the object type. Now I would recommend Argoverse 2 for the task of multi-agent prediction. Unfortunately, now the benchmarks of Argoverse 1 and Argoverse 2 only evaluate one agent per scene, so we have to report single-agent results to align with the performance numbers in other papers.

parthjdoshi commented 1 year ago

@ZikangZhou Thank you for the response.

I made the following changes in the validation step for the HiVT model to calculate the metrics for all the actors.

        # Adding the scene level info with all the actors
        #print(y_hat_best.shape, data.y.shape)
        y_hat_best_all_actors = y_hat_best[:,:, :2]
        self.val_minADE.update(y_hat_best_all_actors, data.y)
        self.val_minFDE.update(y_hat_best_all_actors, data.y)
        self.log("all_actors_minADE", self.val_minADE, prog_bar=True, on_step=False, on_epoch=True, batch_size=num_tracks)
        self.log("all_actors_minFDE", self.val_minFDE, prog_bar=True, on_step=False, on_epoch=True, batch_size=num_tracks)

I have not masked the missing values. Any tips for how to do that? I did not think it was a significant issue, since it only affected approximately 3% of the scenarios for Argoverse 2.

Related query, since you mentioned using Argoverse 2 - AV2 does not have some information such as traffic control and lane turn direction which you use within the model. Did you just pad those values with zeros or is there some API call within AV2 to recover those?

ZikangZhou commented 1 year ago

In AV1, only the single focal agent and the autonomous vehicle are guaranteed to have complete trajectories. All other agents' future trajectories are very noisy and have a large number of missing values. So if you want to calculate metrics for all agents in the scene correctly, you need to ignore those invalid time steps (for FDE and MR, you'd better skip those agents who do not have valid data at the final time step). I have provided data['padding_mask'] in the preprocessed data (with the shape of [N, 50] and with the value of True if a time step is invalid), which can be used for masking.

The information in AV2 is slightly different, but the "semantic attributes" can be very flexible, depending on what information you have. Just set up the model architecture according to the available information.

parthjdoshi commented 1 year ago

Thanks a ton, @ZikangZhou!