Closed bo-miao closed 2 years ago
Please accept my apologies for the late response.
We cannot simulate event data for the first frame, which is quite similar to what happens with the optical flow modality. But this isn't an issue because the main objective is action recognition, and we always have more than one frame. Furthermore, this circumstance occurs only on the first frame of a video, not for each sample, when considering a continuous stream of data.
More information about the number of frames aggregated may be found in the supplementary material. However, you are correct. For each original pair of frames, we obtain one channel of the event representation. Considering that we use three channels for each voxel representation frame in our experiments, it corresponds to six original frames.
Thanks for your answer!
Hi,
Thank you for the interesting work.
ESIM generate events between a pair of frames. In that case, may I know how did you generate event representation for the first frame of each video since it does not has any past frame information? In addition, do you aggregate all events between a pair of frames via Voxel Grid to make the event representation number consistent with the frame number before interpolation?