daily-demos / llm-talk

Talk to GPT-4 and create a story together.
BSD 2-Clause "Simplified" License
80 stars 17 forks source link

client-side conversation flow logic #12

Closed kylemcdonald closed 10 months ago

kylemcdonald commented 11 months ago

Hi! In the writeup on this app I saw that it says "On the client side, we're using JavaScript audio APIs to monitor the input level of the microphone to determine when the user has started and stopped talking". But I looked through the code base and I can't find anything like that.

I also checked to see if the speaker object is being evaluated for audio levels, but it seems like that isn't the case either.

As far as I can tell, there are two conditions where there is an endpoint:

  1. re.search(r'[\.\!\?]$', self.transcription) matches
  2. It has been 5 seconds since the last fragment.

Can you confirm that I'm not missing anything? If there is another demo somewhere that shows an example of monitoring input level, that would be very helpful. Thanks.

kylemcdonald commented 10 months ago

for anyone else looking into this, i ended up implementing silero-vad on the backend. it's a little tricky to get everything right because the timing of deepgram results and silero can be out of sync due to various buffering, resampling, latency, etc. pieces. it's also not ideal to have a whole neural network running just to get reliable endpointing. but i realize this is also a complex problem that has domain-dependent solutions. maybe when this issue is closed https://github.com/daily-co/daily-python/issues/11 it will be slightly easier.