Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.54k stars 242 forks source link

how syphus for ego4d get object_description field? #277

Closed Maxlinn closed 11 months ago

Maxlinn commented 11 months ago

hi otter team much thanks for your great work!

recently i'd like to use syphus to produce data for ego4d on my own, i found syphus for fpv code referenced object_description field, which seems storing short phrases of exsiting things at a timestmap.

but to my knowledge, ego4d dataset itself does not provide object descriptions of timestamps, just short handwritten narrations. can you give some hint on how you get object descriptions?(may come from some public models) and, if you may, would like to share it?

much thanks for your kind help!

Luodian commented 11 months ago

Thanks for looking into these details, we have the following steps to process the videos.

We will cut the videos into frames in 1FPS, and we will only keep those consecutively 16 frames that inside these frames should at least have one annotation.

For example, [x, .... y], inside the range of x-y, there should at least be one frame that have an original annotation provided by EGO4D's original. For original annotation, we refer to ego4d_data/v2/annotations/all_narrations_redacted.json.

For these keepped frames, we use the BLIP-2 and GRIT models to caption and detect the objects inside each frames.

And we will generate an auxillary annotation file looks like the following format:

    "8d928865-5d5f-4b10-b1cb-ef439c5c8ecd": {
        "clips": [
            {
                "narrations": [
                    {
                        "time": 0.0,
                        "text": "The cameraman holds a lace cloth in her hands",
                        "object_description": [
                            "the man is wearing a black shirt",
                            "a black computer keyboard"
                        ],
                        "boxes": [
                            [
                                67,
                                142,
                                353,
                                287
                            ],
                            [
                                0,
                                0,
                                42,
                                119
                            ]
                        ],
                        "width": 298,
                        "height": 224
                    },
                    {
                        "time": 0.0,
                        "text": "The cameraman holds a piece of lace in her hands",
                        "object_description": [
                            "the man is wearing a black shirt",
                            "a black computer keyboard"
                        ],
                        "boxes": [
                            [
                                67,
                                142,
                                353,
                                287
                            ],
                            [
                                0,
                                0,
                                42,
                                119
                            ]
                        ],
                        "width": 298,
                        "height": 224
                    },
                    {
                        "time": 0.51466,
                        "text": "The cameraman looks at the piece of lace",
                        "object_description": [
                            "the man is wearing a black shirt",
                            "a black computer keyboard"
                        ],
                        "boxes": [
                            [
                                67,
                                142,
                                353,
                                287
                            ],
                            [
                                0,
                                0,
                                42,
                                119
                            ]
                        ],
                        "width": 298,
                        "height": 224
                    },
                    ...

Please correct me if the used detection/caption models are wrong. @ZhangYuanhan-AI

Maxlinn commented 11 months ago

sorry for late response, much thanks for sharing!