argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 267 forks source link

Standard output while processing. #123

Open quist00 opened 5 months ago

quist00 commented 5 months ago

Both the original and whisper.cpp dump the processing to standard out by default. Whisperkit seems silent till the end, and verbose flag seems to be outputting much lower level information. Assuming I just didn't read the documentation correctly, please consider dumpling to standard output by default. Where there is substantial performance differences, it is easy to see just by testing against the same file allowing simple gestalt comparison of the different implementations and different model within the same implementation.

ZachNagengast commented 5 months ago

We do have different log levels, sounds like you're interested in logLevel: .info rather than debug? For the CLI this is hardcoded at the moment, so we can add this as a new CLI argument. Anything specific you'd especially like to see in the info logs?

atiorh commented 5 months ago

@quist00 Adding to Zach's point, if you are interested in a streaming application (as opposed to offlline processing of a file) and want to test/emulate the streaming performance on a file, you can use --stream-simulated in the CLI.

quist00 commented 5 months ago

It would be great if that could be added as a flag to the CLI. Streaming applications is not something we are really looking at currently. I work at a library and we want to use whisper internally to drastically reduce the time and expenditure to transcribe / translate items for oral history projects. I and many of my colleagues have Apple Silicon, so I really appreciate you all working on options for us that work more efficiently. I want to share it with other researchers around campus who also may have dozens or hundreds of hours of audio to contend with, so command line will really be the best options for most of them rather than a programmatic API approach given they are not programmers in most cases nor do they have any on staff.

As far as the output, I think the time stamps along with chunks of text as it goes is best. That way, novice users can get rough estimates of if I use this model with whisperkit, then I can estimate that I will get x minutes of output for a minute of processing. They can then grade the output and determine what is the right tradeoff of model verse processing time.

Thanks for you consideration.

ZachNagengast commented 5 months ago

@quist00 Could you perhaps give an example of the input/output pairs you're looking for? That way we can build toward a CLI flag that would result in an acceptable output for you.