MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.44k stars 238 forks source link

AssertionError: chunk size too large, text got clipped #169

Closed Reinmor closed 2 months ago

Reinmor commented 4 months ago

Hello.

I get an error with a 95mb mp3 audio file:

AssertionError                            Traceback (most recent call last)
[<ipython-input-10-a13f7098883a>](https://localhost:8080/#) in <cell line: 1>()
      5     words_list = list(map(lambda x: x["word"], wsm))
      6 
----> 7     labled_words = punct_model.predict(words_list)
      8 
      9     ending_puncts = ".?!"

[/usr/local/lib/python3.10/dist-packages/deepmultilingualpunctuation/punctuationmodel.py](https://localhost:8080/#) in predict(self, words)
     47             text = " ".join(batch)
     48             result = self.pipe(text)
---> 49             assert len(text) == result[-1]["end"], "chunk size too large, text got clipped"
     50 
     51             char_index = 0

AssertionError: chunk size too large, text got clipped

Can you tell me how to solve it?

Reinmor commented 4 months ago

I saw that this issue has already been raised. https://github.com/MahmoudAshraf97/whisper-diarization/issues/74 I will try to apply the solution suggested there.

Reinmor commented 2 months ago

The solution works