MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.53k stars 301 forks source link

List index out of range error for diarize_paralell.py #51

Closed kronkinatorix closed 1 year ago

kronkinatorix commented 1 year ago
/home/.........../token_classification.py:168: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="none"` instead.
  warnings.warn(
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/....../whisper-diarization/venv/diarize_parallel.py:137 in         │
│ <module>                                                                     │
│                                                                              │
│   134 │                                                                      │
│   135 │   words_list = list(map(lambda x: x["word"], wsm))                   │
│   136 │                                                                      │
│ ❱ 137 │   labled_words = punct_model.predict(words_list)                     │
│   138 │                                                                      │
│   139 │   ending_puncts = ".?!"                                              │
│   140 │   model_puncts = ".,?-:"                                             │
│                                                                              │
│ /home/.....//whisper-diarization/venv/ython3.10/site-packages/deepmultilingualpunctuation/pun │
│ ctuationmodel.py:39 in predict                                               │
│                                                                              │
│   36 │   │                                                                   │
│   37 │   │   # if the last batch is smaller than the overlap,                │
│   38 │   │   # we can just remove it                                         │
│ ❱ 39 │   │   if len(batches[-1]) <= overlap:                                 │
│   40 │   │   │   batches.pop()                                               │
│   41 │   │                                                                   │
│   42 │   │   tagged_words = []                                               │
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: list index out of range

I tried throwing a try block around len(batches[-1] <= overlap, and threw in len(baches)[0] <= overlap to boot (not great at this programming thing really and still learning) in the punctuationmodel.py file and was able to successfully generate srt files / transcriptions for a couple audio files I was working with,but then it came back.

Hope this is helpful!

MahmoudAshraf97 commented 1 year ago

Hello, this problem seems to be with the punctuation model code which I don't have access to or mainain, it's better if you open this issue on their repo