collabora / WhisperFusion

WhisperFusion builds upon the capabilities of WhisperLive and WhisperSpeech to provide a seamless conversations with an AI.
1.45k stars 101 forks source link

WhisperLive `segments` only transcribe last 25 seconds #26

Closed DamianB-BitFlipper closed 5 months ago

DamianB-BitFlipper commented 5 months ago

Code reference: https://github.com/collabora/WhisperFusion/blob/main/whisper_live/trt_server.py#L376-L389

The highlighted code does not make logical sense to me and seems buggy. Mostly that segments is only the last_segment. The last_segment is the clipped audio of the last 25 seconds. If someone talks for longer than 25 seconds with no EOS in between, then the demo UI would not record the earliest part of the conversation. Also the point of line 380 makes no clear sense.

I am trying to understand how this works with the UI. Looking at the server.py code of the most recent WhisperLive makes much more sense.

makaveli10 commented 5 months ago

Yes, segments is only the last_segment. We didnt remove the whisper_live logic of having segments because at some point we plan to introduce timestamps in TensorRT backend where we will have more than one segment. FOr now, TensorRT doenst output the timestamps tokens thats where the issue is.

That said, we have to clean this up but it was good to start with. And yes, you're right about the last 25 seconds as well(because we dont have timestamps from TensorRT). Currently, we support short exchanges, we will support interruptions and long exchanges in the conversation in upcoming updates. Thanks for your interest in the project.

DamianB-BitFlipper commented 5 months ago

Ok!