facebookresearch / ImageBind

ImageBind One Embedding Space to Bind Them All
Other
8.35k stars 767 forks source link

IMU Input Dimensions are Unclear - Missing Information on Data Prep #66

Open steppelord opened 1 year ago

steppelord commented 1 year ago

Hello,

What is the required format for IMU input embeddings? Or rather, why does T have to be 2000? I've tried to run the code using sample embeddings as specified in the appendix of the paper.

For IMU we use a 6×T tensor to represent the sequence of IMU sensor readings over time.

Initially I tried to use the sample from the Ego-4D dataset: https://ego4d-data.org/docs/data/imu/

but this kept throwing size mismatch errors

I am trying to create a joint embedding for a single

Does this mean the model requires a minimum of 2000 time steps for IMU sensors?

Thank you for your help

aelimame commented 1 year ago

Based on that sample from the Ego-4D dataset (https://ego4d-data.org/docs/data/imu/) the sample rate is 200Hz (5ms each time step). If only T=2000 works, this means they expect the clips to correspond to a 10 seconds video segment?

However they mention this in the paper:

For each video, We select all time-stamps that contains a synchronized IMU signal as well as aligned narrations. We sample 5 second clips around each time-stamp.

So, there seems to be some 2x ratio lost somewhere?

artemisp commented 1 year ago

It seems that we are supposed to use repeated padding?

PadIm2Video(pad_type="repeat", ntimes=2)

aelimame commented 1 year ago

It seems that we are supposed to use repeated padding?

PadIm2Video(pad_type="repeat", ntimes=2)

But that's for the image to video transformation (forward() method). It seems to convert a single image to n time steps video. Basically either copying the same image to create a video of the given image (pad_type="repeat") or just using zeros/black images (pad_type="zero") to create the video sequence.

So not related to the IMU processing really.

artemisp commented 1 year ago

I agree - I am just making the conjecture that since we want image-IMU alignments for training, if this is the procedure for image padding, it could work for IMU padding to maintain the alignment - even though it is nowhere to be found in the code/paper. It is worth a try. Another option would be to sample 10s - but it seems to directly contradict the paper.

Grabbing a 10s video clip and aligning it with the 5s IMU could make sense - given that there may be a small 1-2s misalignment between IMUs and Videos due to various factors (e.g. latency).

Now....this is all a guess! I tried this method for action recognition (see IMU2CLIP paper) and it seemed to work decently. However, I cannot say for sure if it is the right way to go.

aelimame commented 1 year ago

I agree - I am just making the conjecture that since we want image-IMU alignments for training, if this is the procedure for image padding, it could work for IMU padding to maintain the alignment - even though it is nowhere to be found in the code/paper. It is worth a try. Another option would be to sample 10s - but it seems to directly contradict the paper.

Grabbing a 10s video clip and aligning it with the 5s IMU could make sense - given that there may be a small 1-2s misalignment between IMUs and Videos due to various factors (e.g. latency).

Now....this is all a guess! I tried this method for action recognition (see IMU2CLIP paper) and it seemed to work decently. However, I cannot say for sure if it is the right way to go.

Yeah sure, this is all hypothesis waiting for the FAIR guys to validate...

Thanks for sharing that paper, it looks interesting. Do they also provide source code?

artemisp commented 1 year ago

Oh yes of course, I did not mean for this to be a final answer - just trying to help out/start a discussion since it has been a while without a response 🥲.

Yes they do provide source code, but once again, the embedding dimension is 1000 corresponding to 5second clips.

For my use case I tried the following to account for the 2x factor: pad with zeros, grab 10 second clips, and the "repeat" method, and it seemed that the repeat method works best. I hope this helps to get your application moving.

aelnouby commented 1 year ago

Hi everyone,

Thanks for your question and sorry for the late response. The IMU signal corresponds to 10 second clips, this is a typo in the appendix that will be fixed in the coming revision of the paper. For the aligned video, we sample 2 frames at the center of the window.

beitong95 commented 1 year ago

Hi, I was wondering what is the normalization method used on IMU data in ImageBind. It seems the data from ego4d is raw imu data. However, in Figure 7, I found IMU data is clipped to -1 to 1.

zainhas commented 1 year ago

@beitong95 Good point, another issue with the preprocessing is that it doesn't work for any inputs greater than or less than 2000 points - in my current implementation I've just padded upto 2k or cut down and only taken the first 2k datapoints to generate embeddings. Would be good to know the details about how the model was trained so that embeddings are more reliable!

RitvikKapila commented 1 year ago

Hi, I had a question similar to that of @beitong95, how is the IMU input preprocessed and/or normalized before being fed as an input to the model? Is there a load_and_transform function provided for IMU? Thanks.