A proper way of getting it is by using the camera intrinsic parameters to project 3D camera coordinates to 2D pixel coordinates, as done here.
For this project though, we followed what MotionBERT did and simply took the (X, Y) from the (X, Y, Z) as the 2D ground truth (You can see it here)
The reason why that should be fine is that MotionAGFormer takes the normalized 2D pose sequence as input and it is in range [-1, 1]. The 3D pose sequence is already normalized (and rescaled to be the same scale as the 2D pose as I explained here) and as a result doing it is equivalent to convert it to 2D pose sequence in an standard way and then normalizing it.
For params I used the function implemented here. For MACs/frame you can use a library such as torchprofile that computes the MACs for the whole model. Then simply you have to divide the number for all the output frames to have MACs/frame.
I appreciate for your great work! I have some questions on Training, please give us some help ,thx!