Ego4d preprocessing steps

liruiw commented 2 years ago

First congratulations on your paper and thanks for sharing the codebase! I am recently studying some of the implementation details and trying to reproduce the R3M training on Ego4D. I am wondering about if there are some preprocessing steps on the Ego4D dataset that have not been released. For instance, I think the paper suggests to split the video into individual frames (are the videos used a subset of the full_scale Ego4D) and the codebase suggests to use 'len' and "txt" key in the manifest file (these fields seem to be missing). It would be great if you can hint on how to preprocess the dataset or point to related code that I might have missed. Thanks!

suraj-nair-1 commented 2 years ago

Hello, thanks for your interest in the work! Yes because of some differences in the format of the old internal and current external versions of the dataset, the pre-processing steps for the public dataset will be different. The steps are:

Use the public Ego4D CLI/tools to download only the language annotated clips from Ego4D (the canonical clips).
For each clip, parse it into 224x224 frames (at the original frame rate), and save each frame numerically (e.g. 000012.jpg) in a folder. Ego4D may provide tools to do this, or you could use a standard video processing package like moviepy.
For each clip, save the path to each folder, video length, and annotation text in a csv file manifest.csv.

liruiw commented 2 years ago

Thanks for your reply! Just to confirm, for "annotation text", you mean the 'summary_text' in narration.json for each video.

suraj-nair-1 commented 2 years ago

Hi, I think you may actually want 'narration_text'. It should be just a single sentence or so of what is happening in the short clip like "C is chopping the tomatoes" or "C wiping the window with the rag".

liruiw commented 2 years ago

I see. I think we are referring to the same text, but just named differently in different versions. Thanks!

taokong commented 2 years ago

Use the public Ego4D CLI/tools to download only the language annotated clips from Ego4D (the canonical clips).

Hi @suraj-nair-1 @liruiw , could you tell how to only download the language annotated clips? Thanks in advance.

liruiw commented 2 years ago

I think I downloaded the full-scale version from the official website with language annotations.

jasonseu commented 2 years ago

Hi, I think you may actually want 'narration_text'. It should be just a single sentence or so of what is happening in the short clip like "C is chopping the tomatoes" or "C wiping the window with the rag".

Hi, the 'narration_text' in narration.json for each video only associates with a 'timestamp_frame' and a 'timestamp_sec'. Without a 'end_frame' or 'end_sec', how to get a video clip to pair with the 'narration_text' ?

suraj-nair-1 commented 2 years ago

If you are using the canonical clips, each clip should be a short video, and the text should refer to the entire clip.

jasonseu commented 2 years ago

If you are using the canonical clips, each clip should be a short video, and the text should refer to the entire clip.

I see. Do you mean to download clips using ego4d --output_directory="~/ego4d_data" --datasets clips annotations? By this way, I got 12283 video clips. However, the file narration.json contains the annotations of the videos in full_scale, and looks like follows : I am still confused that how to associate each 'narration_text' in the file narration.json with the corresponding video clips.

MathisClautrier commented 2 years ago

I had the same problems when trying to download the clips directly. I chose to install the full_scale videos and extract the clips. You can merge the captions (narration.json) by timestamp_sec (there can be multiple captions for a single clip) and then extract the clips sequentially. By doing this, though, you will have multiple captions for some videos. From a personal exploration of the resulting clips, this process seems satisfactory and yields many clips (I chose not to extract clips that were too long or too short).

jasonseu commented 2 years ago

I had the same problems when trying to download the clips directly. I chose to install the full_scale videos and extract the clips. You can merge the captions (narration.json) by timestamp_sec (there can be multiple captions for a single clip) and then extract the clips sequentially. By doing this, though, you will have multiple captions for some videos. From a personal exploration of the resulting clips, this process seems satisfactory and yields many clips (I chose not to extract clips that were too long or too short).

Thanks for you reply. If I understand correctly, the clips yielded by your approach will correspond to multiple text descriptions. This will lead to a coarse match between language and video, which may hinder the learning of video-language alignment of R3M model.

MathisClautrier commented 2 years ago

Only a fraction of the clips will have multiple text descriptions (when the actor is doing multiple things simultaneously). You can decide to leave them out. Now, I don't know if including them will hinder learning the video-language alignment, because a given video may correspond to several actions. However, I haven't tried using such clips and I can't say whether it will impact learning or not.

suraj-nair-1 commented 2 years ago

Thanks @MathisClautrier for clarifying this, I haven't used the ego4d CLI, as the internal version of the dataset/loading it was different.

Also, I used a single text description per clip. However if a clip does match multiple text descriptions, I actually don't think training with multiple would cause a problem, assuming when you sample a video you also sample one of the text descriptions.

jasonseu commented 2 years ago

Thanks @MathisClautrier for clarifying this, I haven't used the ego4d CLI, as the internal version of the dataset/loading it was different.

Also, I used a single text description per clip. However if a clip does match multiple text descriptions, I actually don't think training with multiple would cause a problem, assuming when you sample a video you also sample one of the text descriptions.

How long of each video clips in your internal version? They are all about a few minutes long in the new version.

suraj-nair-1 commented 2 years ago

I see ok, the clips I extracted were much shorter, each containing a single short behavior. On average each clip was about ~200 frames, so about ~10 seconds.

jasonseu commented 2 years ago

Ok, I think I need to extract clips manually. Thank you for your patience.

facebookresearch / r3m

Ego4d preprocessing steps #13