I am currently working on IMU inference.
I use the group normalization for ACC and GYRO (from IMU2CLIP), respectively, and observe some problems below:
A 5-second or 10-second clip is too long, it can cover > 2 narrations, so it can be ambiguous. I modified it by a 2-second clip and pad it to 10 seconds, not sure whether I am correct.
According to the paper, it is not sure whether the IMU embedding corresponds to "summary of the full video" or "one sentence of the narration".
Since the IMU performance is relatively low and the signal is not readable by humans, it is really hard for me to confirm whether I am correct or not.
Thank you in advance for your help!
I attach an image below, where each subplot refers to a 2-second clip for one narration, you can see 1 & 2 are identical and 3 & 4 are identical two. The reason behind is that the narrations are just too close to each other.
Dear contributors,
I am currently working on IMU inference. I use the group normalization for ACC and GYRO (from IMU2CLIP), respectively, and observe some problems below:
I attach an image below, where each subplot refers to a 2-second clip for one narration, you can see 1 & 2 are identical and 3 & 4 are identical two. The reason behind is that the narrations are just too close to each other.