For the PIE Dataset the input is a tensor of the form:
[Batch Size] [Dec Size = 15] [ 4 ]
And the output goal tensor is of the form:
[ Batch Size ] [ 45 ?? ] [ Dec Size = 15 ] [ 4 ]
I have the following questions:
In the input, [ 0 ] [ 0 : 14 ] [ 4 ] would correspond the 14 frames leading up to the current frame, correct?
In the output goal tensor, Where does the 4th dimension come from? If we are predicting 45 frames into the future why is it not of size [ Batch Size ] [ 45 ?? ] [ 4 ] ?
Yes, your are correct. The Input tensor has shape [Batch, Enc_steps=15, 4]
The output goal tensor has shape [Batch, Enc_steps=15, Dec_steps=45, 4]. Because we embed goals to each encoder cell, there are Enc_steps=15 sets of stepwise goals. Same for the predicted trajectory, but we only evaluate the last encoder dimension, which corresponds to the 45 frames into the future. Note we feed goals as features ([Batch, Dec_steps=45, 128]) to each encoder cell instead of regressed location.
The regressed goal and trajectory tensor both are in normalized cxcywh form, but we convert them to x1y1x2y2 for evaluation.
For the PIE Dataset the input is a tensor of the form:
And the output goal tensor is of the form:
I have the following questions: