You can also find the codes of pretraining in Embodied Family
You can also download the testing model named Embodied_family_7Btiny
Welcome to the official GitHub repository for the EgoCOT Dataset! This dataset is designed to address the challenges in embodied planning by providing a large-scale collection of egocentric videos and corresponding step-by-step planning instructions. We have also extended the dataset to include the EgoVQA dataset, focusing on egocentric human-object interaction video question answering tasks.
In this README, we will introduce the key features of EgoCOT, its construction process, and the possibilities it offers for various embodied tasks.
The construction of EgoCOT involves a meticulous process to ensure the quality and relevance of the data. Here is an overview of the steps involved:
Selection of egocentric videos: We carefully curated a subset of egocentric videos from the Ego4D dataset to form the foundation of \dataset. These videos cover a wide range of real-world scenarios and capture diverse embodied experiences.
Machine-generated planning instructions: We employed state-of-the-art machine learning techniques to generate initial planning instructions for each video. These instructions serve as a starting point for the subsequent filtering and verification steps.
Semantics-based filtering: To enhance the quality of the planning instructions, we applied a semantics-based filtering mechanism. This process helps ensure that the instructions are accurate, meaningful, and aligned with the video content.
Human verification: To guarantee the correctness and clarity of the planning instructions, we engaged human annotators to review and verify each instruction. This step helps eliminate any remaining errors or ambiguities, resulting in a reliable dataset.
EmbodiedGPT is an innovative multi-modal model developed based on the EgoCOT and EgoVQA datasets. This model provides an end-to-end solution for various embodied tasks, allowing natural and intuitive interaction with the physical world.
Some key tasks that EmboodiedGPT can perform include:
This dataset contains examples of video data, where each sample consists of a sequence of eight consecutive frames represented as numpy arrays, along with associated captions and embodied planning information. The dataset is intended for tasks related to video analysis, captioning, and embodied planning research. The goal of this dataset is to provide a resource for evaluating the alignment between video data and embodied planning descriptions.
Each data sample in the dataset is represented in JSON format and has the following fields:
image
: The file name of the numpy array file containing the sequence of eight consecutive frames representing the video. These frames can be used to reconstruct the video for analysis.
caption
: A brief and simple caption describing the content of the video.
planning
: The embodied planning information associated with the video. It contains a series of actions required to achieve a specific goal in the given video context. The actions are represented as a string, and each action is listed with a corresponding step number. The format is as follows:
pick up the bag of clothes. Put the bag of clothes on the floor.
actions:
1. pick up(bag of clothes)
2. put on(bag of clothes, floor)
score
: A numeric score indicating the alignment between the video and the embodied planning. The score is calculated based on some metric and measures how well the planning description matches the actual content of the video. Any data samples with a score lower than 0.2 have been removed from the dataset during the data cleaning process.
Below is an example of a single data sample in the dataset:
{
"image": "EGO_1.npy",
"caption": "C places the bag of clothes on the floor",
"planing": "pick up the bag of clothes. Put the bag of clothes on the floor.\nactions:\n1. pick up(bag of clothes)\n2. put on(bag of clothes, floor)",
"score": 0.268310546875
}
This dataset is made available for research purposes only. Researchers and developers can utilize this dataset to evaluate and benchmark algorithms and models related to video analysis, captioning, and embodied planning. However, users are required to cite the source of the dataset appropriately in their publications or works.
If you find this project useful in your research, please consider cite:
@article{anonymousembodiedgpt,
author = {Anonymous},
title = {EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought},
journal = {Under Review},
year = {2023},
}