Closed McCloudS closed 10 months ago
Haven't looked at the API in detail - but the host address looks wrong? host='192.168.111'
Hah, yeah, thanks, I ended up finding that this morning after looking at it with fresh eyes. I'm moving to a FastAPI to mimic the whisper-asr container that should make it semi drop-in.
Just running into an issue that whisper doesn't like the spooledtemporaryfile (audio_file) that's returned from bazarr. I'm not sure how whisper-asr-webservice is able to pass it directly into the transcribe function.
@app.post("/asr")
async def asr(
task: Union[str, None] = Query(default="transcribe", enum=["transcribe", "translate"]),
language: Union[str, None] = Query(default=None),
initial_prompt: Union[str, None] = Query(default=None),
audio_file: UploadFile = File(...),
encode: bool = Query(default=True, description="Encode audio first through ffmpeg"),
output: Union[str, None] = Query(default="srt", enum=["txt", "vtt", "srt", "tsv", "json"]),
word_timestamps: bool = Query(default=False, description="Word level timestamps")
):
logging.debug("This hook is from Bazarr/ASR webhook!")
global model
try:
print(f"Transcribing file from Bazarr")
start_time = time.time()
if model is None:
logging.debug("Model was purged, need to re-create")
model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions)
tempfile = io.BytesIO(open(audio_file.file, "rb").read())
result = model.transcribe_stable(audio_file.read(), task=transcribe_or_translate)
result.to_srt_vtt(f"/tmp/{audio_file.filename}.{output}", word_level=word_level_highlight)
elapsed_time = time.time() - start_time
minutes, seconds = divmod(int(elapsed_time), 60)
print(f" {audio_file.filename} is completed, it took {minutes} minutes and {seconds} seconds to complete.")
except Exception as e:
print(f"Error processing or transcribing {audio_file.filename}: {e}")
finally:
if len(files_to_transcribe) == 0:
logging.debug("Queue is empty, clearing/releasing VRAM")
del model
gc.collect()
return StreamingResponse(
io.BytesIO(open(f"/tmp/{audio_file.filename}.{output}", "rb").read()),
media_type="text/plain",
headers={
#'Asr-Engine': ASR_ENGINE,
'Content-Disposition': f'attachment; filename="/tmp/{audio_file.filename}.{output}"'
})
It looks like whisper-asr does some pre-packaging/processing for audio:
def load_audio(file: BinaryIO, encode=True, sr: int = SAMPLE_RATE):
"""
Open an audio file object and read as mono waveform, resampling as necessary.
Modified from https://github.com/openai/whisper/blob/main/whisper/audio.py to accept a file object
Parameters
----------
file: BinaryIO
The audio file like object
encode: Boolean
If true, encode audio stream to WAV before sending to whisper
sr: int
The sample rate to resample the audio if necessary
Returns
-------
A NumPy array containing the audio waveform, in float32 dtype.
"""
and then passes the numpy object around:
result = transcribe(load_audio(audio_file.file, encode), task, language, initial_prompt, word_timestamps, output)
Where as you seem to passing the file-path and stable-ts does the work. Not sure if that makes any difference!
Thanks. Even replicating load_audio, I get similar issues. What's puzzling to me, is faster-whisper (and stable-ts) use pyav to ingest anything and resample as necessary. Even when I attempt to convert audio_file to bytes (which stable-ts and faster-whisper) will accept as an argument, I still get odd errors.
@app.post("/asr")
async def asr(
task: Union[str, None] = Query(default="transcribe", enum=["transcribe", "translate"]),
language: Union[str, None] = Query(default=None),
initial_prompt: Union[str, None] = Query(default=None),
audio_file: UploadFile = File(...),
encode: bool = Query(default=True, description="Encode audio first through ffmpeg"),
output: Union[str, None] = Query(default="srt", enum=["txt", "vtt", "srt", "tsv", "json"]),
word_timestamps: bool = Query(default=False, description="Word level timestamps")
):
logging.debug("This hook is from Bazarr/ASR webhook!")
try:
print(f"Transcribing file from Bazarr")
start_time = time.time()
logging.debug("Model was purged, need to re-create")
model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions)
print(f"File type is: {type(audio_file)}")
print(f"File name is: {audio_file.filename}")
print("File content type is: {audio_file.content_type}")
import shutil
audio_file.file.seek(0)
with open("./audio_file.wav", "wb") as new_file:
shutil.copyfileobj(load_audio(audio_file.file, encode), new_file)
audio_file.file.seek(0)
result = model.transcribe_stable(load_audio(audio_file.file, encode), task=transcribe_or_translate)
result.to_srt_vtt(f"/tmp/{audio_file.filename}.{output}", word_level=word_level_highlight)
elapsed_time = time.time() - start_time
minutes, seconds = divmod(int(elapsed_time), 60)
print(f" {audio_file.filename} is completed, it took {minutes} minutes and {seconds} seconds to complete.")
except Exception as e:
print(f"Error processing or transcribing {audio_file.filename}: {e}")
return StreamingResponse(
io.BytesIO(open(f"/tmp/{audio_file.filename}.{output}", "rb").read()),
media_type="text/plain",
headers={
#'Asr-Engine': ASR_ENGINE,
'Content-Disposition': f'attachment; filename="/tmp/{audio_file.filename}.{output}"'
})
def load_audio(file: BinaryIO, encode=True, sr: int = 16000):
"""
Open an audio file object and read as mono waveform, resampling as necessary.
Modified from https://github.com/openai/whisper/blob/main/whisper/audio.py to accept a file object
Parameters
----------
file: BinaryIO
The audio file like object
encode: Boolean
If true, encode audio stream to WAV before sending to whisper
sr: int
The sample rate to resample the audio if necessary
Returns
-------
A NumPy array containing the audio waveform, in float32 dtype.
"""
if encode:
try:
# This launches a subprocess to decode audio while down-mixing and resampling as necessary.
# Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
out, _ = (
ffmpeg.input("pipe:", threads=0)
.output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
.run(cmd="ffmpeg", capture_stdout=True, capture_stderr=True, input=file.read())
)
except ffmpeg.Error as e:
raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
else:
out = file.read()
return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
File type is: <class 'starlette.datastructures.UploadFile'>
File name is: audio_file
File content type is: {audio_file.content_type}
Error processing or transcribing audio_file: [input_sr] is required when [audio] is a PyTorch tensor or NumPy array.
Or even trying:
print(f"File content type convert is: {type(audio_file.file.read())}")
result = model.transcribe_stable(audio_file.file.read(), task=transcribe_or_translate)
Gives:
File type is: <class 'starlette.datastructures.UploadFile'>
File name is: audio_file
File content type is: None
File content type convert is: <class 'bytes'>
Error processing or transcribing audio_file: [Errno 1094995529] Invalid data found when processing input: '<none>'
Getting peculiar data. When bazarr sends the audio, it appears incomplete. When I ffprobe any file I get from it, it returns, "Invalid data found when processing input". When I send my own audio via postman or python, it comes through fine and transcribes... I have it dumping the received file in the script folder to look at.
@app.post("/asr")
async def asr(
task: Union[str, None] = Query(default="transcribe", enum=["transcribe", "translate"]),
language: Union[str, None] = Query(default=None),
initial_prompt: Union[str, None] = Query(default=None),
audio_file: UploadFile = File(...),
encode: bool = Query(default=True, description="Encode audio first through ffmpeg"),
output: Union[str, None] = Query(default="srt", enum=["txt", "vtt", "srt", "tsv", "json"]),
word_timestamps: bool = Query(default=False, description="Word level timestamps")
):
logging.debug("This hook is from Bazarr/ASR webhook!")
try:
print(f"Transcribing file from Bazarr")
start_time = time.time()
logging.debug("Model was purged, need to re-create")
model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions)
print(f"File type is: {type(audio_file)}")
print(f"File name is: {audio_file.filename}")
print(f"File content type is: {type(audio_file.content_type)}")
print(f"File content type convert is: {type(audio_file.file.read())}")
import shutil
audio_file.file.seek(0)
with open("./audio_file.wav", "wb") as new_file:
shutil.copyfileobj(audio_file.file, new_file)
audio_file.file.seek(0)
result = model.transcribe_stable(audio_file.file.read(), task=transcribe_or_translate)
result.to_srt_vtt(f"/tmp/{audio_file.filename}.{output}", word_level=word_level_highlight)
elapsed_time = time.time() - start_time
minutes, seconds = divmod(int(elapsed_time), 60)
print(f" {audio_file.filename} is completed, it took {minutes} minutes and {seconds} seconds to complete.")
except Exception as e:
print(f"Error processing or transcribing {audio_file.filename}: {e}")
return StreamingResponse(
io.BytesIO(open(f"/tmp/{audio_file.filename}.{output}", "rb").read()),
media_type="text/plain",
headers={
#'Asr-Engine': ASR_ENGINE,
'Content-Disposition': f'attachment; filename="/tmp/{audio_file.filename}.{output}"'
})
I'm at a standstill if anyone wants to take a crack at it. You'll have to import a couple more libs to run what I have above, but it should be obvious on run.
After a bit of hairpulling finally traced the error to stable-ts expecting numpy arrays to provide the sample rate to the model. sr_input=16000. Should have a release up in a bit.
Done. See https://github.com/McCloudS/subgen/edit/main/README.md#bazarr
Tried a couple runs and it seems to work, let me know. I'll leave this open for comments.
Nice job. Works for me - apart from trying to melt my Synology ds920+ CPU!
Using medium.en model, took about 14mins to trans a 7min short film.
Thanks again.
I've attempted to replicate the existing capability from
https://wiki.bazarr.media/Additional-Configuration/Whisper-Provider/ which uses the code from
https://github.com/ahmetoner/whisper-asr-webservice/blob/51c6eceda0836d145048224693c69c2706d78f46/app/webservice.py#L61-L78
The code block called from Bazarr is
https://github.com/morpheus65535/bazarr/blob/a09cc34e09407b8a2338d1034de7f8ff8fc91b19/libs/subliminal_patch/providers/whisperai.py#L284-L295
The code block I came up with:
In the Bazarr logs I'm getting
2023-10-24 22:22:49,787 - retry.api (14dee3942b38) : WARNING (api:40) - HTTPConnectionPool(host='192.168.111', port=8090): Max retries exceeded with url: /asr?task=transcribe&language=en&output=srt&encode=false (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x14dee0988610>, 'Connection to 192.168.111 timed out. (connect timeout=9000)')), retrying in 5 seconds...
I'm at a loss. I can't even get a webhook response from Bazarr when trying to manually generate a subtitle using Whisper as the provider. The machines are reachable to each other via pings. I'm sure I'm missing something simple.