Closed JuanS286 closed 2 weeks ago
A model was defined to get a set of frames from the GIF's, give them to BLIP model which is going to generate one caption for each frame. Next, the group of captions related to every single gif are going to T5 to aggregate them and generate the final output.
Create a model that trains both T5 and BLIP, the last one with backpropagation of the T5 cost function