[video-generation-ego-movement] Question about video generation

Good afternoon,

For video generation, the MagicDrive paper states that only the first and the last frames have bounding boxes (section 5.4). I have the following question:

How do you encode the movement of the car in the last frame, relative to its initial position in the first frame?

The ego-pose of the car changes in these 7 frames (around 4 seconds duration), but if I understand correctly both the bounding boxes and the camera poses have the ego-point of the car as reference. Therefore, they do not inject any information regarding the change in ego pose of the car from its starting position to the finish. In my mind, this is important information to inject to the video generation model but I may be missing something.

Thank you in advance for your feedback.

cure-lab / MagicDrive

[video-generation-ego-movement] Question about video generation #33