Closed jenhungh closed 2 years ago
Hi, thank you for the interest in our work. The shape of the inputs (images) to the model would be (B,N,3,H,W), meaning that for each batch sample there are N images of resolution (H,W), each corresponding to one camera. So, even with data shuffling only the samples are shuffled, that doesn't change the images within each one.
After reading the FSM paper and looking at the code, I am still a little bit confused about the input shape for the FSM model. We need to input 6 synchronized images in order to compute the spatio-temporal loss. So, should the shape of the inputs be (B, 6, 3, H, W) or (B, 3, H, W)? If the shape is (B, 3, H, W), then the batch size should be 6, but how could we make sure the images are synchronized with data shuffling? Thanks.