facebookresearch / ImageBind

ImageBind One Embedding Space to Bind Them All
Other
8.2k stars 751 forks source link

Splitting VISION into text and video #81

Open kilowrk opened 1 year ago

kilowrk commented 1 year ago

Hi,

I have worked out using the different data loaders to compare images and video but cannot work out a way to do both simultaneously. It seems using ModalityType.VISION, you can only load and transform either video or still image at one time and not both (from my limited understanding). I think it would be a good idea to split VISION into two types of modality for ease of use. Anyone able to point me to a way to do this?

LinB203 commented 10 months ago

你好,

我已经尝试使用不同的数据加载器来比较图像和视频,但无法找到同时执行这两种操作的方法。似乎使用 ModalityType.VISION ,您只能一次加载和转换视频或静态图像,而不能同时加载和转换两者(根据我有限的理解)。我认为将 VISION 分成两种类型以便于使用是一个好主意。有人能指出我的方法吗?

hi, here to recommend our work, which is LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. We split vision into video and image, so you can enter a image or a video.