MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.28k stars 272 forks source link

Chunk size too large #74

Closed lgq-liao closed 1 year ago

lgq-liao commented 1 year ago

Hello @MahmoudAshraf97

Thanks for the awesome sharing. I had tried your solution, everything looks great. There was only a chunk size too large issue encountered when I used the test sample icsi_corpus.wav

When I run your original script with the test sample above, the first error encountered at .

result_aligned = whisperx.align( whisper_results, alignment_model, metadata, vocal_target, self.device )

Failed to align segment (" Well, but you're talking about one per frame."): backtrack failed, resorting to original...
Failed to align segment (" Right."): backtrack failed, resorting to original...
Failed to align segment (" Or she."): backtrack failed, resorting to original...
Failed to align segment (" 034450567."): no characters in this segment found in model dictionary, resorting to original...

The second time the script broke and exited at

labled_words = punct_model.predict(words_list)

File "diarize.py", line 137, in <module>
    labled_words = punct_model.predict(words_list)
  File "/home/coe4-ws/anaconda3/lib/python3.8/site-packages/deepmultilingualpunctuation/punctuationmodel.py", line 49, in predict
    assert len(text) == result[-1]["end"], "chunk size too large, text got clipped"
AssertionError: chunk size too large, text got clipped

For the full log, please refer to the chunk_size_too_large.txt

MahmoudAshraf97 commented 1 year ago

Hello, the firs error is just a warning and it's expected in some cases, if the second error happens with all the files not just this one let me know

lgq-liao commented 1 year ago

@MahmoudAshraf97 I had tried to isolate the issue to see whether it's specific input audio issue. And I random downloaded 3 more test samples(Bed002-004) from ICSI corpus, only Bed004 pass the script.

  1. Run the script with the sample of Bed002 encountered out of memory issue at

msdd_model.diarize()

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.14 GiB 
(GPU 0; 7.79 GiB total capacity; 6.78 GiB already allocated; 689.44 MiB free; 6.80 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any idea about the memory usage ?

  1. Run the script with the sample of Bed003 , it failed at

result_aligned = whisperx.align()

    result_aligned = whisperx.align(
  File "/home/coe4-ws/anaconda3/lib/python3.8/site-packages/whisperx/alignment.py", line 302, in align
    char_segments_arr = per_seg_grp.apply(lambda x: x.reset_index(drop = True)).reset_index()
  File "/home/coe4-ws/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 6209, in reset_index
    new_obj.insert(
  File "/home/coe4-ws/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 4772, in insert
    raise ValueError(f"cannot insert {column}, already exists")
ValueError: cannot insert subsegment-idx, already exists
cateyelow commented 1 year ago

the issue has solved by decreasing chunk_size on /home/user/.local/lib/python3.10/site-packages/deepmultilingualpunctuation/punctuationmodel.py

line 29

    def predict(self,words):
        overlap = 5
        chunk_size = 100 # original value 230
        if len(words) <= chunk_size:
            overlap = 0

if you encounter after decreasing value, decrease more chunk_size 100 -> 80 -> 60

lgq-liao commented 1 year ago

@cateyelow , yes, it works , thanks for the fix

jhdeov commented 4 months ago

I got this error from running the "Realligning Speech segments using Punctuation" section from the colab file. How do I change chunk sizes there?