Egoinstructor

Official Pytorch implementation for Egoinstructor at CVPR 2024

Retrieval-Augmented Egocentric Video Captioning
Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Given an egocentric video, Egoinstructor automatically retrieves semantically relevant instructional videos (e.g. from HowTo100M) via a pretrained cross-view retrieval model and leverages the visual/textual information to generate the caption of the egocentric video.

Roadmap

[x] Retrieval code and data released
[x] Captioning code and data released
[ ] Online Demo
[x] Pre-trained retrieval checkpoints
[ ] Pre-trained captioning checkpoints

Prepare environment

Please refer to env.md

Cross-view Retrieval Module

To train a ego-exo crossview retrieval module, please refer to retrieval.

Retrieval-augmented Captioning

To train a retrieval-augmented egocentric video captioning model, please refer to captioning.

Citation

If this work is helpful for your research, please consider citing us.

@article{xu2024retrieval,
  title={Retrieval-augmented egocentric video captioning},
  author={Xu, Jilan and Huang, Yifei and Hou, Junlin and Chen, Guo and Zhang, Yuejie and Feng, Rui and Xie, Weidi},
  journal={arXiv preprint arXiv:2401.00789},
  year={2024}
}

License

This project is released under the MIT License

Acknowledgements

This project is built upon LaViLA and Otter. Thanks to the contributors of the great codebase.

Jazzcharles / Egoinstructor

readme