X-LANCE / SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model
MIT License
461 stars 36 forks source link

what's the model's input #115

Closed xiayu-cell closed 1 month ago

xiayu-cell commented 1 month ago

Hello, I read the dataset, and see the input is the whole audio and one image's ocr result? Did I understand right?

ddlBoJack commented 1 month ago

Did you mean the Slidespeech dataset?

xiayu-cell commented 1 month ago

yes

xiayu-cell commented 1 month ago

for input,all frame ocr result or one frame ocr result put into prompt?

yanghaha0908 commented 1 month ago

"for each speech segment, we extract the middle frame image and apply TD and OCR models from the MMOCR toolkit to extract the words in the slide."[1] [1] SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS

For your question, the whole audio and one image's ocr result. Yes. one frame ocr result put into prompt.

xiayu-cell commented 1 month ago

Did you split the audio into parts corresponding to different frame images?If not ,each inference jusy input whole input and one frame image ocr ,what is the output? whole subtitle or part of whole video?

yanghaha0908 commented 1 month ago

No. The output is the ASR result of the input speech segment, a part of the whole video.

xiayu-cell commented 1 month ago

Could you share how you split the video or audio?

yanghaha0908 commented 1 month ago

The audio from all videos is segmented using VAD, and the ASR system is then utilized to generate candidate transcripts for each segment. Thus, we obtain the audio/text segments. [1] [1] SLIDESPEECH: A LARGE SCALE SLIDE-ENRICHED AUDIO-VISUAL CORPUS

xiayu-cell commented 1 month ago

Thank you!