argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 268 forks source link

Want to use AVCaptureSession buffers instead of AVAudioEngine #44

Open cgfarmer4 opened 7 months ago

cgfarmer4 commented 7 months ago

Hey there!

First off, thanks so much for building this awesome library! Its a total pleasure to use and works great. Looking forward to the Metal update. In the meantime, I was curious if you all would accept a PR to allow for AVCaptureSession to be used in the AudioProcessor class instead of AVAudioEngine.

I was thinking of creating a way to pass in a new setupEngine function that allowed for the captureOutput delegate to be used in place of the installTap function. The reason I want to do this is it makes it easier to change the microphone in app instead of relying on the system default.

  1. Would it make sense to allow for this in the AudioProcessor? If so, Im happy to come up with a clean interface proposal.
  2. If no, perhaps theres a way to override the AudioProcessor class and provide an alternate setupEngine function?
atiorh commented 7 months ago

Thanks for the note @cgfarmer4! @ZachNagengast what do you think?

cgfarmer4 commented 6 months ago

Ah I just found this code ;) // TODO: implement selecting input device

Decided against using AVCaptureSession but instead to just change the device using CoreAudio. Seems to work for macbook + continuity microphone but haven't figured out why it doesn't work for my audio interface yet. Thoughts on this if I can figure out why it doesnt work for my external audio interface?

  1. New assignMicrophoneInput function:
func assignMicrophoneInput(inputNode: AVAudioInputNode, inputDeviceID: AudioDeviceID) {
        guard let audioUnit = inputNode.audioUnit else {
            Logging.error("Failed to access the audio unit of the input node.")
            return
        }

        var inputDeviceID = inputDeviceID

        let error = AudioUnitSetProperty(
            audioUnit,
            kAudioOutputUnitProperty_CurrentDevice,
            kAudioUnitScope_Global,
            0,
            &inputDeviceID,
            UInt32(MemoryLayout<AudioDeviceID>.size)
        )

        if error != noErr {
            Logging.error("Error setting Audio Unit property: \(error)")
        } else {
            Logging.info("Successfully set input device.")
        }
    }
  1. Update setupEngine
        func setupEngine(inputDeviceID: AudioDeviceID? = nil) throws -> AVAudioEngine {
        let audioEngine = AVAudioEngine()
        let inputNode = audioEngine.inputNode
        let inputFormat = inputNode.outputFormat(forBus: 0)

        if let inputDeviceID = inputDeviceID {
            assignMicrophoneInput(inputNode: inputNode, inputDeviceID: inputDeviceID)
        }
  1. Update start recording function to allow for passing in the AudioDeviceID
        func startRecordingLive(inputDeviceID: AudioDeviceID? = nil, callback: (([Float]) -> Void)? = nil) throws {
        audioSamples = []
        audioEnergy = []

        audioEngine = try setupEngine(inputDeviceID: inputDeviceID)

        // Set the callback
        audioBufferCallback = callback
    }

Going to see if I can try some tactics from this thread for my interface but seems hacky.

ZachNagengast commented 6 months ago

@cgfarmer4 thanks for the effort looking into this. This looks promising, although I would also support an additional method that uses AVCaptureSession to generate audioSamples in case some folks already had easy access their apps AVCaptureDevice. There is nothing specifically tied to audioengine in the protocol, we'd just need to make sure it has handling for the various different platforms that don't have access to those apis (watchOS for example doesn't support it). Curious to see how your tests go and would be happy to integrate these back into the AudioProcessor depending on the results.

cgfarmer4 commented 6 months ago

Decided against using AVCaptureSession since theres quite a bit of buffer conversion involved that likely adds latency (loosely held hypothesis). This meets my needs since I can take the AVCaptureSession selected AVCaptureDevice and get the AudioDeviceID from it on macOS. AVCaptureSession would give us a list of devices on other OSes but what it wont do, is allow for AVAudioEngine to have its audioUnit changed. In order to get a more comprehensive list of devices from other OSes, we'd need to figure out the buffer conversion mechanism and keep it fast enough from CMSampleBuffer to AVAudioPCMBuffer.

https://github.com/argmaxinc/WhisperKit/pull/51

static func getAudioDeviceID(for captureDevice: AVCaptureDevice) -> AudioDeviceID? {
        var propertySize: UInt32 = 0
        var address = AudioObjectPropertyAddress(
            mSelector: kAudioHardwarePropertyDevices,
            mScope: kAudioObjectPropertyScopeGlobal,
            mElement: kAudioObjectPropertyElementMain
        )

        AudioObjectGetPropertyDataSize(AudioObjectID(kAudioObjectSystemObject), &address, 0, nil, &propertySize)

        let deviceCount = Int(propertySize) / MemoryLayout<AudioDeviceID>.size
        var deviceIDs = [AudioDeviceID](repeating: 0, count: deviceCount)
        let status = AudioObjectGetPropertyData(AudioObjectID(kAudioObjectSystemObject), &address, 0, nil, &propertySize, &deviceIDs)

        if status == noErr {
            for id in deviceIDs {
                var uidSize: UInt32 = 0
                var uidAddress = AudioObjectPropertyAddress(
                    mSelector: kAudioDevicePropertyDeviceUID,
                    mScope: kAudioObjectPropertyScopeGlobal,
                    mElement: kAudioObjectPropertyElementMain
                )

                AudioObjectGetPropertyDataSize(id, &uidAddress, 0, nil, &uidSize)

                var deviceUID: Unmanaged<CFString>?
                var uidPropertySize = UInt32(MemoryLayout.size(ofValue: deviceUID))

                let uidStatus = AudioObjectGetPropertyData(id, &uidAddress, 0, nil, &uidPropertySize, &deviceUID)

                if uidStatus == noErr, let deviceUID = deviceUID?.takeUnretainedValue() as String? {
                    if captureDevice.uniqueID == deviceUID {
                        return id
                    }
                } else {
                    logger.error("Failed to get device UID with error: \(uidStatus)")
                }
            }
        } else {
            logger.error("Failed to get device IDs with error: \(status)")
        }

        return nil
    }
cgfarmer4 commented 6 months ago

Previously when I was working with SwiftWhisper, I could translate the format from CMSampleBuffers using the float conversion method here. This implementation was not great but good enough for demos. Im curious if theres some conversion here that might be doing similar?

https://gist.github.com/cgfarmer4/182d9d6d1cdf9d219ba0a4db6a23d745#file-capturedelegate-swift-L1-L46 https://gist.github.com/cgfarmer4/182d9d6d1cdf9d219ba0a4db6a23d745#file-audiosessionmanager-swift-L88-L111

bradleyandrew commented 2 months ago

I would be interested in having this integrate with AVCaptureSession too. Given the ability to use UVC Capture Devices in iPadOS 17 which is accessed via AVCaptureSession it'd be handy to pass a CMSampleBuffer directly into WhisperKit. Allows audio sources from any HDMI Video Source, Cameras, Game Consoles, etc.

I'll need to do some testing, I already have a pipeline setup for Video and Audio Processing from CMSampleBuffer. I will explore using the code snippets linked from @cgfarmer4 to convert the CMSampleBuffer to a [Float] and then pass that into WhisperKit to see if I can get it working.

Currently CMSampleBuffer operates at its own cadence in terms of sample rate as it's essentially just used to display Audio Meters based on 'averagePowerLevel'. Looks like I might need to force sample rate to be 16000 and to only pass in audio from Channel 1 should the audio source be Stereo.

Thanks!