IhorShevchuk / piper-ios-app

GNU General Public License v2.0
5 stars 2 forks source link

High quality models not running on iPhone 13. Medium Quality models can only speak short phrases. #1

Open S-Ali-Zaidi opened 1 month ago

S-Ali-Zaidi commented 1 month ago

Hello -- I’ve managed to build and run the application on my iPhone 13 Pro Max

I’m finding that any high quality models I attempt to build and run on the iPhone 13 fail to run -- and iOS falls back to using the compact Siri voice.

Medium quality models are able to run -- but only speak in short phrases. If given a long sentence or phrase, the system will also fall back to using the compact Siri Voice.

I am able to get the high quality models running fine on the iOS and iPad OS simulations, as well as on the Mac build.

I’m a bit confounded by this behavior on iPhone 13. The difference between the models in terms of size is about 60mb vs 110mb. My iPhone is able to run LLMs in the 1B to 7B parameter range (at varying speeds), so a bit surprised that it seems to be struggling with the high quality models that are around 20m parameters in size?

For context -- I’m able to run Piper Medium models just fine using WASM interfaces, such as this one.

Is this an issue with ONNX runtime, something to do with the handling of audio buffers, or some other issue? Out of my depth to be able to troubleshoot this myself! Help would be appreciated.

Videos demonstrating the issue here -- using the en_GB-Cori medium model. I see the same issue with en_GB-Jenny medium. High quality models fail to run at all:

https://github.com/IhorShevchuk/piper-ios-app/assets/122964093/8db1a245-fd07-4905-b6d4-824b896994a2

https://github.com/IhorShevchuk/piper-ios-app/assets/122964093/d1d35c23-d24f-411f-801e-0e8fa89e62c5

S-Ali-Zaidi commented 1 month ago

Another example demonstrating this when using the en_GB-Jenny medium quality voice on the Speech Central app. You can see it’s performing quite well -- until it hits a longer sentence, causing the system to fall back to compact Siri.

https://github.com/IhorShevchuk/piper-ios-app/assets/122964093/941681c2-4551-4427-9bdb-1c37637e555c

IhorShevchuk commented 1 month ago

Hello @S-Ali-Zaidi, problem with running high quality models is their size. Apple Text to Speech engine is built on top of AVSpeechSynthesisProviderAudioUnit and this is application extension specific class, unfortunately app extension that is built of top Audio Unit on iOS can't use more than 60 mb of RAM. ONNX runtime loads whole file to RAM during initialisation, this is why model can't be more than ~50 mb because there should be some space left for audio buffers and extension itself.

This is one of the reasons why this application is not more than proof of concept in addition to missing SSML support and callback about finished sentence and word.

S-Ali-Zaidi commented 1 month ago

Hello @S-Ali-Zaidi, problem with running high quality models is their size. Apple Text to Speech engine is built on top of AVSpeechSynthesisProviderAudioUnit and this is application extension specific class, unfortunately app extension that is built of top Audio Unit on iOS can't use more than 60 mb of RAM. ONNX runtime loads whole file to RAM during initialisation, this is why model can't be more than ~50 mb because there should be some space left for audio buffers and extension itself.

This is one of the reasons why this application is not more than proof of concept in addition to missing SSML support and callback about finished sentence and word.

Ah, I had no Idea of the ~60mb limitation, which was certainly frustrating to find out.

However, I did a little poking around the model parameters (with the help of ChatGPT, as this is not my domain) and noted that essentially almost all Piper model weights are in FP32.

import onnx
import numpy as np

# Load the ONNX model
model_path = "/Users/s.alizaidi/Programming/TTS_ALL/Piper_TTS/modules/piper/models/en_GB/en_GB-jenny.onnx"
model = onnx.load(model_path)

# Function to convert ONNX TensorProto data types to human-readable format
def tensor_dtype(tensor):
    if tensor.data_type == onnx.TensorProto.FLOAT:
        return 'float32'
    elif tensor.data_type == onnx.TensorProto.FLOAT16:
        return 'float16'
    elif tensor.data_type == onnx.TensorProto.INT8:
        return 'int8'
    elif tensor.data_type == onnx.TensorProto.INT32:
        return 'int32'
    elif tensor.data_type == onnx.TensorProto.INT64:
        return 'int64'
    # Add other data types as needed
    else:
        return 'unknown'

# Iterate through the initializers and print their data types
for initializer in model.graph.initializer:
    weight_name = initializer.name
    weight_dtype = tensor_dtype(initializer)
    print(f"Weight Name: {weight_name}, Data Type: {weight_dtype}”)

Found this page, which gave instructions on reducing onnx weights to fp16.

After a few failed attempts, I managed to convert the Jenny Medium model to fp16 with this script:

import onnx
from onnxconverter_common import float16

# Load the model
model_path = "/Users/s.alizaidi/Programming/TTS_ALL/Piper_TTS/modules/piper/models/en_GB/jenny.onnx"
model = onnx.load(model_path)

# Convert the model to float16, keeping the inputs and outputs as float32
model_fp16 = float16.convert_float_to_float16(
    model,
    keep_io_types=True,
    op_block_list=['RandomNormalLike', 'Range']
)

# Save the converted model
onnx.save(model_fp16, "/Users/s.alizaidi/Programming/TTS_ALL/Piper_TTS/modules/piper/models/en_GB/jenny_fp16.onnx")

Happy to report that the en-GB-Jenny-Medium model is now 32 mb in size, rather than 63mb -- and it is still running inference in Piper perfectly fine!

I tried the same on the uk_UA-lada model, and after being converted to fp16, it’s reduced from 20mb to 10mb!

Attaching a drive link to the fp16 files and audio samples from the fp32 vs fp16 models.

https://drive.google.com/drive/folders/1WlB4GBs1mohKi_8y9AxMztFZKokXUMl1?usp=share_link

I’ll report back once I test out the fp16 models within a build of the iOS app. Hopefully this makes the Piper iOS app more feasible to develop!

S-Ali-Zaidi commented 1 month ago

Hello! A brief update:

With FP16 weights, the ua-UK model, which is now about 10mb in size, seems to run pretty well on my iPhone 13 -- both within the app as well as when called on as a system voice in various applications. It still had an issue where longer sentences will cause the system to fall back on a compact default voice.

The same thing goes for the Jenny FP16 model, which takes up about 20-30mb of space -- but the "tolerance" for iOS before it falls back on a default voice is shorter than the 10mb Lada model, in terms of sentence length.

You can see this demonstrated in the video where it lags in a sentence at the middle of the excerpt, and then totally stops at the final sentence -- that is usually when the debugging terminal on Xcode will display an error.

https://github.com/IhorShevchuk/piper-ios-app/assets/122964093/84ad1033-46fc-472d-b97b-6017b150381a

In terms of memory usage, when I've tried generating texts of various length and complexity on the app itself using Jenny -- on my iPhone, Mac, as well as various simulated iOS devices, I find that the memory usage never really goes above 50mb, and typically hovers around 45mb. It tends to be VERY responsive and quick on the swift app on my Mac and on various iOS simulations.

What I'm wondering is if the fallback seen on real iOS devices like the iPhone 13 might have less to do with the 80mb memory limit on Audio Extension based apps, and rather more to do with the request timing out due to some internal threshold iOS has for TTS generation?

If so, I wonder how a Vit-based app like your might fare if it utilized a more efficient, natural, and compact version of VITS -- such as a mini MB-iSTFT-VITS, which tends to have a very significant real-time factor (0.02x) in speech generation compared to the vanilla VITS used by Piper? (0.27x or 0.099x depending on model size)

IMG_9756

If I calculated correctly, a 7m parameter mini MB model should take up about 15mb or less when stored with fp16 weights?

I'm also wondering if CoreML conversions of any sort of VITS models might speed up the performance and/or reduce their memory usage to make them usable within the AVSpeechSynthesisProvider extension framework.

Would be curious to hear your thoughts! Thank you!

castdrian commented 4 weeks ago

@S-Ali-Zaidi it'd be cool if you'd publish your stuff on your fork, I've been trying to get this repo to work with a few piper voices (ie Amy medium) because none of the built in iOS voices have the quality I require for my PokéDex app, but they end up extremely weirdly pitched so it'd be cool to have a second reference point to using different voices

S-Ali-Zaidi commented 3 weeks ago

@S-Ali-Zaidi it'd be cool if you'd publish your stuff on your fork, I've been trying to get this repo to work with a few piper voices (ie Amy medium) because none of the built in iOS voices have the quality I require for my PokéDex app, but they end up extremely weirdly pitched so it'd be cool to have a second reference point to using different voices

I’ll be returning to this project soon, and happy to push updates to my fork when I do! Most of what I’ve bene doing has been on the model quantization side, rather than making changes to the actual scripts themselves.

THAT SAID -- if you are having issues of weird pitch with your voice, I am 90% certain that is because of a sample rate mismatch between your model and what the audio unit is rendering the audio in.

Note that the sample rate for Amy Medium is 22,050hz, according to the model card (And likely the config json):

# Model card for amy (low)

* Language: en_US (English, United States)
* Speakers: 1
* Quality: medium
* Samplerate: 22,050Hz

## Dataset

* URL: https://github.com/MycroftAI/mimic3-voices
* License: See URL

## Training

Finetuned from U.S. English lessac voice (medium quality).

You need to make sure the sample rate you have set within PiperAudioUnit.swift is set to match the sample rate of the Piper model you are using.

Check line 28:

self.format = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 16000.0, channels: 1, interleaved: true)!

Note that by default it is set to sampleRate: 16000.0 because that was the output sample rate of the Lada Piper model used by @IhorShevchuk. IF you have not already, change this so it reads:

self.format = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 22050.0, channels: 1, interleaved: true)!

I’m pretty sure this is your issue -- as sample rate mismatches are the only reason I can think of getting high pitched audio outputs.

Make sure to also replace any references of uk_UA-lada in PiperAudioUnit.swift with the filename and config filename of your Amy model -- as well as changing the references of primaryLanguages: ["uk-UA"], supportedLanguages: ["uk-UA”] to your desired language (which for English would be en-US or en-GB).

For reference, here is how looks on my end after I made such modifications for a 22,050hz Lessac Piper model:

//
//  piperttsAudioUnit.swift
//  pipertts
//
//  Created by Ihor Shevchuk on 27.12.2023.
//

// NOTE:- An Audio Unit Speech Extension (ausp) is rendered offline, so it is safe to use
// Swift in this case. It is not recommended to use Swift in other AU types.

import AVFoundation

import piper_objc
import PiperappUtils

public class PiperttsAudioUnit: AVSpeechSynthesisProviderAudioUnit {
    private var outputBus: AUAudioUnitBus
    private var _outputBusses: AUAudioUnitBusArray!

    private var request: AVSpeechSynthesisProviderRequest?

    private var format: AVAudioFormat

    var piper: Piper?

    @objc override init(componentDescription: AudioComponentDescription, options: AudioComponentInstantiationOptions) throws {

        self.format = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 22050.0, channels: 1, interleaved: true)!

        outputBus = try AUAudioUnitBus(format: self.format)
        try super.init(componentDescription: componentDescription, options: options)
        _outputBusses = AUAudioUnitBusArray(audioUnit: self, busType: AUAudioUnitBusType.output, busses: [outputBus])
    }

    public override var outputBusses: AUAudioUnitBusArray {
        return _outputBusses
    }

    public override func allocateRenderResources() throws {
        try super.allocateRenderResources()
        Log.debug("allocateRenderResources")
        if piper == nil {
            let model = Bundle.main.path(forResource: "lessac_med_fp16", ofType: "onnx")!
            let config = Bundle.main.path(forResource: "lessac_med_fp16.onnx", ofType: "json")!
            piper = Piper(modelPath: model, andConfigPath: config)
        }
    }

    public override func deallocateRenderResources() {
        super.deallocateRenderResources()
        piper = nil
    }

    // MARK: - Rendering
    /*
     NOTE:- It is only safe to use Swift for audio rendering in this case, as Audio Unit Speech Extensions process offline.
     (Swift is not usually recommended for processing on the realtime audio thread)
     */
    public override var internalRenderBlock: AUInternalRenderBlock {
        return { [weak self] actionFlags, _, frameCount, _, outputAudioBufferList, _, _ in

            guard let self = self,
            let piper = self.piper else {
                actionFlags.pointee = .unitRenderAction_PostRenderError
                Log.error("Utterance Client is nil while request for rendering came.")
                return kAudioComponentErr_InstanceInvalidated
            }

            if piper.completed() && !piper.hasSamplesLeft() {
                Log.debug("Completed rendering")
                actionFlags.pointee = .offlineUnitRenderAction_Complete
                self.cleanUp()
                return noErr
            }

            if !piper.readyToRead() {
                actionFlags.pointee = .offlineUnitRenderAction_Preflight
                Log.debug("No bytes yet.")
                return noErr
            }

            let levelsData = piper.popSamples(withMaxLength: UInt(frameCount))

            guard let levelsData else {
                actionFlags.pointee = .offlineUnitRenderAction_Preflight
                Log.debug("Rendering in progress. No bytes.")
                return noErr
            }

            outputAudioBufferList.pointee.mNumberBuffers = 1
            var unsafeBuffer = UnsafeMutableAudioBufferListPointer(outputAudioBufferList)[0]
            let frames = unsafeBuffer.mData!.assumingMemoryBound(to: Float.self)
            unsafeBuffer.mDataByteSize = UInt32(levelsData.count)
            unsafeBuffer.mNumberChannels = 1

            for frame in 0..<levelsData.count {
                frames[Int(frame)] = levelsData[Int(frame)].int16Value.toFloat()
            }

            actionFlags.pointee = .offlineUnitRenderAction_Render

            Log.debug("Rendering \(levelsData.count) bytes")

            return noErr

        }
    }

    public override func synthesizeSpeechRequest(_ speechRequest: AVSpeechSynthesisProviderRequest) {
        Log.debug("synthesizeSpeechRequest \(speechRequest.ssmlRepresentation)")
        self.request = speechRequest
        let text = AVSpeechUtterance(ssmlRepresentation: speechRequest.ssmlRepresentation)?.speechString

        piper?.cancel()
        piper?.synthesize(text ?? "")
    }

    public override func cancelSpeechRequest() {
        Log.debug("\(#file) cancelSpeechRequest")
        cleanUp()
        piper?.cancel()
    }

    func cleanUp() {
        request = nil
    }

    public override var speechVoices: [AVSpeechSynthesisProviderVoice] {
        get {
            return [
                AVSpeechSynthesisProviderVoice(name: "Lessac", identifier: "pipertts", primaryLanguages: ["en_US"], supportedLanguages: ["en_US", "en_GB"])
            ]
        }
        set { }
    }

    public override var canProcessInPlace: Bool {
        return true
    }

}

Note that your are still likely going to find you have issues on iOS -- especially for older iPhones like my iPhone 13. Due to the way apple integrates third party TTS systems into the iOS system, there are strict limits on how much RAM piper TTS can use -- about 60-80mb.

Unless you are using a heavily quantized model, you are going to find that your iPhone may fail to render any speech from Piper or may cut off at longer sentences.

I’m currently exploring int-8 quantization aware training of Piper models, as well as retraining some entirely using a tokenizer vocabulary instead of phonemes or graphemes. If that works out, it may result in Piper TTS models being able to run smoothly on iOS. Until then -- don’t expect it to work consistently, even with the above fixes!