Hi! I am trying to reproduce your dataset, however, I have questions about generating utterances. I was wondering if there are some metadata that records each utterance's start time and end time in the video. Or is there any tricks to deal with utterances?
Hi! I am trying to reproduce your dataset, however, I have questions about generating utterances. I was wondering if there are some metadata that records each utterance's start time and end time in the video. Or is there any tricks to deal with utterances?