Open bakermanbrian opened 10 months ago
same question
Anecdotally, I would say the accuracy is worse. The huggingface implementation, which this uses, applies a stride to the chunks. Meaning, some of the input, is duplicated across multiple chunks. They do this to add context, to compensate for the fact that the batched implementation does not get the context from previous chunk, since they are run in parallel.
The issue with this is, it often results in the same text being repeated multiple times, due to the overlapping chunks. In theory, this is corrected via some heuristics, but in practice, I have not found it to work well.
Also, in the huggingface implementation, they do not apply the hallucination checks to chunks shorter than 30 seconds. And in the case of batch inference, every chunk is 30 seconds or less.
Fundamentally, I don't see how a parallelized version of whisper, can achieve the same accuracy as the original serial one, since it lacks the context from previews chunks, which often helps resolve ambiguity.
Haven't found anything for whisper hallucinations so far. And end up writing a simple postprocessing workaround for repeated chunks/hallucinations.
import json
import sys
def find_identical_rows(file_path, N):
line_positions = {}
# Open the file and read line by line
with open(file_path, 'r') as file:
for line_number, line in enumerate(file, start=1):
# Remove any leading/trailing whitespace characters
line = line.strip()
if not line: # Skip empty rows
# print("EMpty line!")
continue
if line in line_positions:
# print("line is already in the dictionary:", line)
line_positions[line].append(line_number)
else:
# print("line is not in the dictionary:", line)
line_positions[line] = [line_number]
# print(json.dumps(line_positions, sort_keys=True, indent=4))
# Identify lines that occur more than once and check the distance
identical_rows = []
for line, positions in line_positions.items():
if len(positions) > 1:
# print(line, positions)
for i in range(len(positions) - 1):
if positions[i + 1] - positions[i] <= N:
identical_rows.append((line, positions))
# print("Identical", i+1, positions[i + 1], i, positions[i])
break
# Print lines that meet the criteria
if identical_rows:
# print(file_path, "Identical rows with required distance:")
print(file_path)
for row, positions in identical_rows:
print(f"(Positions: {positions}) {row}")
#else:
#print("No identical rows found with the required distance.")
input_file = sys.argv[1] # Get the input file path from command-line arguments
find_identical_rows(input_file, 12)
identical-rows.py 01-large-v2.srt
01-large-v2.srt
(Positions: [91, 99, 383, 1459, 4087, 4887]) Yeah.
(Positions: [363, 375]) Hello.
(Positions: [1895, 1899, 1903, 1907, 1911, 1915, 1919, 1923, 1927, 1931, 1935, 1939, 1943, 1947, 1951, 1955, 1959, 1963, 1967, 2899, 2903, 2907, 2911, 2915, 2919, 2923, 2927, 2931, 2935, 2939, 2943, 2947, 2951, 2955, 2959, 2963, 2967, 2971, 3599, 3603, 3607, 3611, 3615, 3619, 3623, 3627, 3631, 3635, 3639, 3643, 3647, 3651, 3655, 3659, 3663, 3667, 3671, 3763, 3767, 3771, 3775, 3779, 3783, 3787, 3791, 3795, 3799, 3803, 3807, 3811, 3815, 3819, 3823, 3827, 3831, 4039, 4307, 4311, 4315, 4319, 4323, 4327, 4331, 4335, 4339, 4343, 4347, 4351, 4355, 4359, 4363, 4367, 4371, 4375, 4379, 4399, 4403, 4407, 4411, 4415, 4419, 4423, 4427, 4431, 4435, 4439, 4443, 4447, 4451, 4455, 4459, 4463, 4467, 4471, 4479, 4483, 4487, 4491, 4495, 4499, 4503, 4507, 4511, 4515, 4519, 4523, 4527, 4531, 4535, 4539, 4543, 4547, 4551, 5159]) Okay.
(Positions: [3507, 3583, 3587, 4571, 4575, 4579]) Silence.
I am using Faster Whisper and the accuracy for Faster Whisper is supposed to be the same as the OpenAI model. Additionally, the memory usage is much lower than the OpenAI model. How does Insanely Fast Whisper compare on both of those fronts?