HKUST-Aerial-Robotics / SIMPL

SIMPL: A Simple and Efficient Multi-agent Motion Prediction Baseline for Autonomous Driving
MIT License
161 stars 13 forks source link

about local reference frame and multi-agent #1

Closed ares89 closed 4 months ago

ares89 commented 4 months ago

First of all, congratulations on your remarkable work! Your paper presents an impressive and novel approach. I appreciate all the effort that went into this accomplishment.

With that said, I had a question regarding one aspect of your methodology. In the paper, you mentioned utilizing a local reference frame. I was wondering if you had explored the effects of not rotating the coordinate system and instead keeping the original global coordinates? If so, what were the results? Did it significantly impact performance or output in any way?

Any insights you could provide about your decision to use the local frame versus keeping global coordinates would be really valuable. I'm quite interested in understanding the trade-offs you may have considered.

Additionally, I'm curious if you had evaluate your approach on argo2 multi-agent tasks? If so, how did it perform in those scenarios? Multi-agent settings can often introduce additional challenges, so I'd be interested to learn if you encountered any specific complexities or considerations when applying your method in that context.

Thank you in advance for any perspective you can offer on this part of your approach. Once again, excellent work - I look forward to seeing what you accomplish next.

MasterIzumi commented 4 months ago

Thanks for your positive comments!

For the first question, if I understand correctly, you want to know the performance difference between the scene-centric representation and the proposed method? Actually, different from agent-centric and scene-centric methods, we leverage "instance-centric" representation, namely, for each instance (actors and map elements), we normalize its coordinates w.r.t. a local frame. As stated in the paper (Sec. III-C), for actors, we locate the reference frame at the current observed state, and for static map elements, such as lane segments, we use the centroid of the polyline as the anchor point and employ the displacement vector between endpoints as the heading angle. Besides, we use the relative positional encoding (RPE) to describe the all-to-all spatial relationship among these instances.

In this scene representation, we can regard the instance features as nodes, and RPE as edges, since the normalized local coordinates only contain their own inherent info, while the relationships are modeled in the RPE. By using the proposed SFT, all instance features can be updated in a symmetric manner, which has better generalizability than scene-centric methods. Compared to agent-centric methods, our method avoids redundant computation, leading to higher efficiency.

We also point out that this "instance-centric" representation is also used in recent methods, such as HDGT, GoRela, QCNet, and MTR++. I would personally think this formulation is more efficient and elegant.

For the second question, we deeply agree that scene-consistent (joint) multi-agent motion prediction is crucial in autonomous driving systems, however, we haven't evaluated SIMPL in this aspect. In my personal view, generating consistent scene-level predictions is a pretty hard task and it is quite interesting to see recent advances in this field, such as FJMP, QCNeXt, and GameFormer. We will investigate this in the future :)

ares89 commented 4 months ago

Thank you for taking the time to address my previous questions. Coordinate rotation and multi-agent settings are such important aspects to explore, given their relevance in production environments. Avoiding unnecessary coordinate transformations can bring computational efficiencies.