How did you generate the sample-level description (the ones in Figure 1)?

Hello,

I hope you're doing well. I recently came across your work and I find it truly intriguing. I was wondering how it differentiates from the research presented at BMVC 2023, which can be found here.

Specifically, I'm curious about the sample-level description generation technique you've introduced. The paper doesn't seem to provide an intuitive explanation regarding this. Is it possible for GPT to generate a coherent description directly from a video segment? Could you shed some light on how the sample-level description was generated? I'd greatly appreciate any insights you can share.

Thank you in advance for your time and clarification!

Best regards

NickyFot / EmoCLIP

How did you generate the sample-level description (the ones in Figure 1)? #1