fedirz / faster-whisper-server

https://hub.docker.com/r/fedirz/faster-whisper-server
MIT License
477 stars 78 forks source link

World-level timestamps #29

Closed paulhargreaves closed 3 months ago

paulhargreaves commented 3 months ago

https://github.com/SYSTRAN/faster-whisper supports the option 'word_timestamps' - it would be useful to expose this for json format. It's necessary to do any accurate form of diarization pipeline.

fedirz commented 3 months ago

There's support for it. Search for timestamp_granularities in faster_whisper_server/main.py

paulhargreaves commented 2 months ago

Thank you. I've been trying to get it working but I'm always returned in segments of 20 words or so.

data = {'language': 'en',
                'stream': 'True',
                'timestamp_granularities': 'word',
                'response_format': 'verbose_json',
                'model': 'Systran/faster-whisper-medium.en',
                }
r = requests.post(url, data=data, files=files)

I've also tried 'timestamp_granularities': ['word'] and 'timestamp_granularities': '[word]', all have the same effect.

I'm using docker latest:gpu, updated this morning.

fedirz commented 2 months ago

I think it needs to be timestampt_granularities[]=word. Slightly awkward API, but that's what's in the OpenAI spec which I'm trying to follow.

❄ ❯ curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "model=Systran/faster-whisper-tiny.en" -F "timestamp_granularities[]=word" -F "response_format=verbose_json"
{"task":"transcribe","language":"en","duration":6.54,"text":"When you call someone who is thousands of miles away, you're using a satellite.","words":[{"start":0.82,"end":1.34,"word":" When","probability":0.71985
26263237},{"start":1.34,"end":1.52,"word":" you","probability":0.9967663288116455},{"start":1.52,"end":1.78,"word":" call","probability":0.9907167553901672},{"start":1.78,"end":2.22,"word":" someone","probability
":0.9944461584091187},{"start":2.22,"end":2.46,"word":" who","probability":0.9258458614349365},{"start":2.46,"end":2.64,"word":" is","probability":0.9671499729156494},{"start":2.64,"end":3.08,"word":" thousands",
"probability":0.8997396230697632},{"start":3.08,"end":3.32,"word":" of","probability":0.8525071144104004},{"start":3.32,"end":3.54,"word":" miles","probability":0.9981999397277832},{"start":3.54,"end":4.08,"word"
:" away,","probability":0.9990777969360352},{"start":4.4,"end":4.54,"word":" you're","probability":0.8400306403636932},{"start":4.54,"end":4.86,"word":" using","probability":0.9926112294197083},{"start":4.86,"end
":5.22,"word":" a","probability":0.864382803440094},{"start":5.22,"end":5.42,"word":" satellite.","probability":0.9605082273483276}],"segments":[{"id":1,"seek":572,"start":0.82,"end":5.42,"text":" When you call s
omeone who is thousands of miles away, you're using a satellite.","tokens":[50363,1649,345,869,2130,508,318,4138,286,4608,1497,11,345,821,1262,257,11210,13,50613],"temperature":0.0,"avg_logprob":-0.27834644466638
564,"compression_ratio":1.0394736842105263,"no_speech_prob":0.035264752805233}]}%
❄ ❯ curl http://localhost:8000/v1/audio/transcriptions -F "file=@audio.wav" -F "model=Systran/faster-whisper-tiny.en" -F "timestamp_granularities[]=segment" -F "response_format=verbose_json"
{"task":"transcribe","language":"en","duration":6.54,"text":"When you call someone who is thousands of miles away, you're using a satellite.","words":[],"segments":[{"id":1,"seek":572,"start":0.82,"end":5.82,"tex
t":" When you call someone who is thousands of miles away, you're using a satellite.","tokens":[50363,1649,345,869,2130,508,318,4138,286,4608,1497,11,345,821,1262,257,11210,13,50613],"temperature":0.0,"avg_logpro
b":-0.27834644466638564,"compression_ratio":1.0394736842105263,"no_speech_prob":0.035264752805233}]}%

Does this answer your question?

paulhargreaves commented 2 months ago

Interesting, thank you. I'll give that a try shortly. Earlier I also tried the swagger interface - can't upload audio to it (known swagger bug) but I could see what the post was expecting, but that didn't have the [] in the curl it generated.

(EDIT:) That worked, thank you!

vmarchenkoff commented 2 months ago

Is it possible to use with OpenAI syntax, not curl?

transcript = client_whisper.audio.transcriptions.create( model="medium", file=audio_file, language='en', timestamp_granularities='segment', response_format='verbose_json') doesn't work...

Thank for your work!

fedirz commented 2 months ago

@vmarchenkoff Yes. The reason it doesn't work is that OpenAI's client expects a list to be passed in, and you are passing in a string. Here's the usage example.