When inference video, only the first frame feature is calculated.

PKU-DataLab commented 3 weeks ago

In the file Oryx/oryx/model/oryx_arch.py

    for idx in range(len(modalities)):
        img_feat_highres, img_size_highres = self.get_model().vision_resampler(highres_img_features[idx],
                                                modalities[idx],
                                                highres_img_sizes[idx])
        img_feat_lowres, img_size_lowres = self.get_model().vision_resampler(lowres_img_features[idx],
                                                modalities[idx],
                                                lowres_img_sizes[idx])
        img_feat = self.get_model().mm_projector(img_feat_lowres,
                                                img_size_lowres,
                                                img_feat_highres,
                                                img_size_highres,
                                                modalities[idx])
        image_features.append(img_feat.flatten(0, 1))

chuyishang commented 3 weeks ago

Encountering same issue

liuzuyan commented 3 weeks ago

Hi, we have checked the code and found the shape of the video tokens normal. Could you provide more information about this issue?

Oryx-mllm / Oryx

When inference video, only the first frame feature is calculated. #17