Question on the MOSEI Annotation Procedure and Alignment With Text

Helw150 commented 5 months ago

Hi All!

Background

I'm currently looking at the data in MOSEI, but finding a fair number of instances where the transcripts provided by the SDK don't overlap with the audio for that particular video and timestamp.

For example, for video y9FyTEyGy5Y (still available here: https://www.youtube.com/watch?v=y9FyTEyGy5Y), the transcript provided by the SDK for the first 10 seconds of the video is """ sp welcome back sp everybody sp the sp number of students foreign exchange sp grows sp every year sp last year sp it was up sp from the previous year sp while sp they sp receive sp financial scholarships most restrict them what sp is sp the sp appeal sp of sp studying sp abroad sp we have a student """ If I normalize that by hand and remove the "sp"s, I get this. """ Welcome back everybody, the number of students foreign exchange grows every year. Last year it was up from the previous year. While they receive financial scholarships, most restrict them. What is the appeal of studying abroad? We have a student... """

But in actuality, this is what is said in the first 10 seconds (with me transcribing by hand). """ Welcome back everybody, the number of international students attending a US college or university continues to grow every year, but the financial aid options available to those undergraduates varies amongst schools. """

My best guess is that the transcription in the dataset is one that was created live for the show, which captures the gist of the content but is significantly different from what was said exactly and has some big gaps in the timestamps.

Question

In cases such as this, where the provided transcription isn't well aligned, I'm trying to figure out whether it is reasonable to re-transcribe directly from the audio or whether there may be some misalignment in the label itself.

Were these transcripts provided to the annotators when the labeled the emotion and sentiment? Or did they annotate purely by watching the video?

If the latter, re-transcription seems a reasonable path forward for my usage, but if annotators saw the transcript I'm not quite sure how to proceed!

smudge1872 commented 3 months ago

I found this to happen quite often also. It seems to happen in videos that are more than 2 minutes long. I wonder if they crop a section of the video and the timestamps given in the labeling is relative to that cropped section of the video.

Here is an example of an 11 minute video ([https://www.youtube.com/watch?v=-9YyBTjo1zo]). ID = -9YyBTjo1zo. From the parsing of the labeling, I interpret from 43 seconds to 48 seconds he should be saying "Republicans are pushing a new bill that is in many ways more radical than previous bills. The new bill". But he does not say this until timestamp 7:19. I guess if we want to find the original audio/video segment we have to sync it up with the transcript that YouTube provides.

Helw150 commented 3 months ago

@abwilf @lpmorency Is there any way offsets or any other info for recreating the raw audio signal could be added to the SDK?

I totally get why you can't distribute the audio itself. However, providing offsets would be invaluable to using MOSEI to test new Speech & Audio models that use audio directly, rather than the features distributed at present!

If there's any way I can help in this, I'd be happy to!

Helw150 commented 3 months ago

I don't know if what I'm seeing is fully explainable by just offsets, since the transcript from the SDK doesn't seem to match what I get from YouTube

I found this to happen quite often also. It seems to happen in videos that are more than 2 minutes long. I wonder if they crop a section of the video and the timestamps given in the labeling is relative to that cropped section of the video.

Here is an example of an 11 minute video ([https://www.youtube.com/watch?v=-9YyBTjo1zo]). ID = -9YyBTjo1zo. From the parsing of the labeling, I interpret from 43 seconds to 48 seconds he should be saying "Republicans are pushing a new bill that is in many ways more radical than previous bills. The new bill". But he does not say this until timestamp 7:19. I guess if we want to find the original audio/video segment we have to sync it up with the transcript that YouTube provides.

CMU-MultiComp-Lab / CMU-MultimodalSDK

Question on the MOSEI Annotation Procedure and Alignment With Text #18

Background

Question