About the motionformer's Qpos in four-folds?

Thank your for your work. when I read the code and the paper, I find out that there is some differences between the figure 5 in the paper with the code MotionTransformerDecoder class.

In the paper , the four folds are added position encoding and MLP, then they are catted togethoer and sent to transformer.

But in the code ,the four folds are divided into dynamic_query_embed and static_intention_embed. How do they correspond？the in_query_fuser layer seems not appear in the paper also.

How can I correspond the code with paper about the motionformer?

dynamic_query_embed = self.dynamic_embed_fuser(torch.cat( [agent_level_embedding, scene_level_offset_embedding, scene_level_ego_embedding], dim=-1)) # the predicted goal point from the previous layer

        # fuse static and dynamic intention embedding
        query_embed_intention = self.static_dynamic_fuser(torch.cat(
            [static_intention_embed, dynamic_query_embed], dim=-1))  # (B, A, P, D)

        # fuse intention embedding with query embedding
        query_embed = self.in_query_fuser(torch.cat([query_embed, query_embed_intention], dim=-1))

OpenDriveLab / UniAD

About the motionformer's Qpos in four-folds? #154