facebookresearch / r3m

Pre-training Reusable Representations for Robotic Manipulation Using Diverse Human Video Data
https://sites.google.com/view/robot-r3m/
MIT License
292 stars 45 forks source link

Ego4d preprocessing steps #13

Closed liruiw closed 2 years ago

liruiw commented 2 years ago

First congratulations on your paper and thanks for sharing the codebase! I am recently studying some of the implementation details and trying to reproduce the R3M training on Ego4D. I am wondering about if there are some preprocessing steps on the Ego4D dataset that have not been released. For instance, I think the paper suggests to split the video into individual frames (are the videos used a subset of the full_scale Ego4D) and the codebase suggests to use 'len' and "txt" key in the manifest file (these fields seem to be missing). It would be great if you can hint on how to preprocess the dataset or point to related code that I might have missed. Thanks!

suraj-nair-1 commented 2 years ago

Hello, thanks for your interest in the work! Yes because of some differences in the format of the old internal and current external versions of the dataset, the pre-processing steps for the public dataset will be different. The steps are:

liruiw commented 2 years ago

Thanks for your reply! Just to confirm, for "annotation text", you mean the 'summary_text' in narration.json for each video.

suraj-nair-1 commented 2 years ago

Hi, I think you may actually want 'narration_text'. It should be just a single sentence or so of what is happening in the short clip like "C is chopping the tomatoes" or "C wiping the window with the rag".

liruiw commented 2 years ago

I see. I think we are referring to the same text, but just named differently in different versions. Thanks!

taokong commented 2 years ago
  • Use the public Ego4D CLI/tools to download only the language annotated clips from Ego4D (the canonical clips).

Hi @suraj-nair-1 @liruiw , could you tell how to only download the language annotated clips? Thanks in advance.

liruiw commented 2 years ago

I think I downloaded the full-scale version from the official website with language annotations.

jasonseu commented 2 years ago

Hi, I think you may actually want 'narration_text'. It should be just a single sentence or so of what is happening in the short clip like "C is chopping the tomatoes" or "C wiping the window with the rag".

Hi, the 'narration_text' in narration.json for each video only associates with a 'timestamp_frame' and a 'timestamp_sec'. Without a 'end_frame' or 'end_sec', how to get a video clip to pair with the 'narration_text' ?

suraj-nair-1 commented 2 years ago

If you are using the canonical clips, each clip should be a short video, and the text should refer to the entire clip.

jasonseu commented 2 years ago

If you are using the canonical clips, each clip should be a short video, and the text should refer to the entire clip.

I see. Do you mean to download clips using ego4d --output_directory="~/ego4d_data" --datasets clips annotations? By this way, I got 12283 video clips. However, the file narration.json contains the annotations of the videos in full_scale, and looks like follows : image I am still confused that how to associate each 'narration_text' in the file narration.json with the corresponding video clips.

MathisClautrier commented 2 years ago

I had the same problems when trying to download the clips directly. I chose to install the full_scale videos and extract the clips. You can merge the captions (narration.json) by timestamp_sec (there can be multiple captions for a single clip) and then extract the clips sequentially. By doing this, though, you will have multiple captions for some videos. From a personal exploration of the resulting clips, this process seems satisfactory and yields many clips (I chose not to extract clips that were too long or too short).

jasonseu commented 2 years ago

I had the same problems when trying to download the clips directly. I chose to install the full_scale videos and extract the clips. You can merge the captions (narration.json) by timestamp_sec (there can be multiple captions for a single clip) and then extract the clips sequentially. By doing this, though, you will have multiple captions for some videos. From a personal exploration of the resulting clips, this process seems satisfactory and yields many clips (I chose not to extract clips that were too long or too short).

Thanks for you reply. If I understand correctly, the clips yielded by your approach will correspond to multiple text descriptions. This will lead to a coarse match between language and video, which may hinder the learning of video-language alignment of R3M model.

MathisClautrier commented 2 years ago

Only a fraction of the clips will have multiple text descriptions (when the actor is doing multiple things simultaneously). You can decide to leave them out. Now, I don't know if including them will hinder learning the video-language alignment, because a given video may correspond to several actions. However, I haven't tried using such clips and I can't say whether it will impact learning or not.

suraj-nair-1 commented 2 years ago

Thanks @MathisClautrier for clarifying this, I haven't used the ego4d CLI, as the internal version of the dataset/loading it was different.

Also, I used a single text description per clip. However if a clip does match multiple text descriptions, I actually don't think training with multiple would cause a problem, assuming when you sample a video you also sample one of the text descriptions.

jasonseu commented 2 years ago

Thanks @MathisClautrier for clarifying this, I haven't used the ego4d CLI, as the internal version of the dataset/loading it was different.

Also, I used a single text description per clip. However if a clip does match multiple text descriptions, I actually don't think training with multiple would cause a problem, assuming when you sample a video you also sample one of the text descriptions.

How long of each video clips in your internal version? They are all about a few minutes long in the new version.

suraj-nair-1 commented 2 years ago

I see ok, the clips I extracted were much shorter, each containing a single short behavior. On average each clip was about ~200 frames, so about ~10 seconds.

jasonseu commented 2 years ago

Ok, I think I need to extract clips manually. Thank you for your patience.