ZikangZhou / HiVT

[CVPR 2022] HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction
https://openaccess.thecvf.com/content/CVPR2022/papers/Zhou_HiVT_Hierarchical_Vector_Transformer_for_Multi-Agent_Motion_Prediction_CVPR_2022_paper.pdf
Apache License 2.0
577 stars 115 forks source link

Question: where do you make the scene centered at separate agents in code? #28

Closed alantes closed 1 year ago

alantes commented 1 year ago

Hello, Dr. Zhou! Thank you for sharing your elegant code. But I have a question:

It seems that you make the scene centered at autonomous vehicle in code:

    # make the scene centered at AV
    origin = torch.tensor([av_df[19]['X'], av_df[19]['Y']], dtype=torch.float)
    av_heading_vector = origin - torch.tensor([av_df[18]['X'], av_df[18]['Y']], dtype=torch.float)
    theta = torch.atan2(av_heading_vector[1], av_heading_vector[0])
    rotate_mat = torch.tensor([[torch.cos(theta), -torch.sin(theta)],
                               [torch.sin(theta), torch.cos(theta)]])

and make the coordinates of other vehicles based on that origin.

But according to your paper, shouldn't the scenes be centered at separate agents? Where are the code that corresponds to this part?

Thank you in advance!

ZikangZhou commented 1 year ago

Hi @alantes,

Thanks for your interest. The per-agent coordinate system transformation is conducted inside the forward pass, not in the data preprocessing part.

During data processing, you can pick an arbitrary global coordinate system, and the prediction result will be the same. I set the global coordinate system to be centered at the autonomous vehicle just for my convenience of doing ablation studies.

In the model forward pass, you can pay attention to the argument data['rotate_mat'] inside local_encoder.py and global_interactor.py. This argument will be used to rotate the input in an agent-centric manner.

alantes commented 1 year ago

Thank you for your quick response. So, I see the rotate_mat is from rotate_angles, which is calculated in process_argoverse(). Every other agent is going to rotate according to a certain agent's heading. Cool!

May I ask where the transition part is? Is there a variable that decides how the coordinates of other agents will change?

ZikangZhou commented 1 year ago

Translation invariance is achieved "automatically".

Looking at this line in data processing, each time step's input token is represented as the difference between consecutive coordinates. If you add a constant to two coordinates, their difference will not change.

On the other hand, in the attention layer, the relative position between agents is concatenated to the key and value embedding. The relative position is also a vector that is translation-invariant.

alantes commented 1 year ago

Thank you for your patient reply. That is inspiring!