Open Lucifer-G0 opened 1 week ago
In both cases we have a window of 64 frames. Padding is added later, so if padding is 7, it will add 7 elements at the beginning and 7 elements at the end. It does not affect the 64 frames window size.
If I remember correctly, the way it works is the following:
We construct the 64 frames window of sparse poses. In ours-0 we use frames 1 to 63 to add previous sparse poses, and in frame 64 we add the current frame sparse pose, then we just read the frame 64 output for the predicted pose. In ours-7 we use frames 1 to 56 to add previous sparse poses, in frame 57 we add the frame we want to process, and frame 58 to 64, we add the "future sparse poses". In this case we read the frame 57 from the output to generate the pose. Conceptually, this is equivalent to having our method running with 7 frames of delay with respect to the application.
Hope it clarifies your question, let me know if you need further detail!
I have some confusion regarding the future frames mentioned in your paper. How are "ours-0" and "ours-7" defined? The convolution across the temporal dimension in the code uses padding of seven frames front and back, with a window of 15. Does this imply that it uses 7 future frames? In the experimental section, you mention a window of 64 frames with 7 future frames. What does the mention of a 64-frame window with 7 future frames mean in the experimental section? Does this refer to a convolution window of 64 frames?