Closed jli262 closed 11 months ago
Using --no_speech does not throw any token in the T5 encoder. If you use a model that has been trained using speech, you instead need to provide empty subtitles which will give the T5 encoder an EOS token as input. As you can see from the results in the VidChapters paper, for the current video chapter generation models, speech is the most important modality, hence it may be possible that on the downstream tasks, similar behavior can be found.
Using --no_speech does not throw any token in the T5 encoder. If you use a model that has been trained using speech, you instead need to provide empty subtitles which will give the T5 encoder an EOS token as input. As you can see from the results in the VidChapters paper, for the current video chapter generation models, speech is the most important modality, hence it may be possible that on the downstream tasks, similar behavior can be found.
Understand. While for the inference of activitynet and charades, the quality of asr feature is too bad.
is it possible to import the checkpoints of the vid2seq model in your last vid2seq paper with little modification into this pytorch inference script to do dense video captioning task, and should also change the t5 to t5 1.1?
It should be possible -- you would just have to make sure that the config matches the one of t5v1.1; I am also not sure if the change from Google ASR to Whisper without retraining would work well.
hi! may i know how to do the inference without speech?
I've set the --no_speech but so that the output is []. And when i do inference in activitynet and charades dataset, the output looks like it only considers the speech feature
Thank you!