Shape mismatch in certain batches

sbuser commented 1 year ago

I've tried for a while here to figure out what is causing this without much success. Batch processing will run for a variety of files but I've come to a group here that throws an IndexError on the 2nd segments of the batch:

File "/app/.venv/lib/python3.9/site-packages/whisper/decoding.py", line 694, in _main_loop
    probs_at_sot.append(logits[:, self.sot_index[i]].float().softmax(dim=-1))
IndexError: index 8 is out of bounds for dimension 1 with size 3

In a normal loop self.sot_index is the same at all indicies: [8, 8, 8, 8, 8, 8] or [11, 11, 11, 11, 11, 11]

In the batch and segment number that fails it looks like this:

[0, 0, 0, 0, 8, 0]   <-- self.sot_index
0 0 torch.Size([6, 3, 51865])  <-- i, self.sot_index[i], logits.shape
1 0 torch.Size([6, 3, 51865])
2 0 torch.Size([6, 3, 51865])
3 0 torch.Size([6, 3, 51865])
4 8 torch.Size([6, 3, 51865])

I'm not tracking how this is happening. I'm not providing any different languages or an initial prompt, so I'm not understanding the mismatch with sot_index here.

I do see that it hasn't properly transcribed portions of that file from the first segment in the output. I don't see where it would be hanging onto that to cause this problem, but something is broken.

Sorry I'm not of more help on this. I'll keep digging.

Blair-Johnson commented 1 year ago

Interesting, I'm happy to help you with debugging this. Do the clips transcribe properly on the official whisper implementation?

sbuser commented 1 year ago

It does, yes. Interestingly, not only does it not generate the IndexError, it also does a better job with the transcription itself. Perhaps related to the temperature cascading discussed in the other issue? I'm not sure.

Without changing anything except adding print statements to diagnose this issue, on maybe the 10th run it did actually pass the step it had previously failed (no IndexError) and put in a bunch of garbage in that segment's transcription. I suppose nothing guaranteed the outputs here are deterministic, but that was surprising to me.

In trying to answer this I also found that the fix for no_speech_prob returning an array of all of the probabilities breaks running whisper against a single audio file (when it bypasses all of the batch code).

Edit to clarify on the non-deterministic behavior: - that was probably related to the other files in the batch potentially changing. I'm batching by files size and there were quite a few files with the exact same size so that likely accounts for the differences between runs rather than the model itself being responsible. If so, then it's pretty clear the temperature linking can have a negative effect. Batching certainly has some effect on outcomes because the file is fully and properly transcribed when run by itself.

JunZhan2000 commented 1 year ago

I missed this too

Blair-Johnson / batch-whisper

Shape mismatch in certain batches #9