TheoCoombes / ClipCap

Using pretrained encoder and language models to generate captions from multimedia inputs.
94 stars 15 forks source link