I want to use pre-trained MMVID with celebvhq-text so i want to know how long text sequence should be for how long frames. is it the same with mmvid training config trained on MM-Vox (frames_num = 8, text sequence = 50)?
And in paper, the text descriptions contain all the "action, face attributes, emotion...etc" information, but you uploaded them separately. Then, let us know how to integrate them into one and which sentence belongs to what frames.
I want to use pre-trained MMVID with celebvhq-text so i want to know how long text sequence should be for how long frames. is it the same with mmvid training config trained on MM-Vox (frames_num = 8, text sequence = 50)?
And in paper, the text descriptions contain all the "action, face attributes, emotion...etc" information, but you uploaded them separately. Then, let us know how to integrate them into one and which sentence belongs to what frames.
Thank you.