VoiceCommand support using SpeechRecognizer

sonnemaf commented 4 years ago

It would be nice if you could assign VoiceCommands to buttons using the UWP SpeechRecognizer. Maybe this must not be limited to buttons only.

<Button Click="ButtonSave_Click"
        Content="Save">
    <Button.VoiceCommands>
        <VoiceCommand Text="Save" />
        <VoiceCommand Text="Store it" />
    </Button.VoiceCommands>
</Button>

Describe the solution

There are a lot of ways to implement this. You can create Attached Properties or use Behaviors. Not sure what the correct path is. I have created this issue to start the discussion.

Describe alternatives you've considered

As a test I have created this VoiceCommandTrigger (Behavior). It works fine. Not sure if this is the right path. It uses the Microsoft.Xaml.Behaviors.Uwp.Managed NuGet package.

public class VoiceCommandTrigger : Trigger {

    public string Text {
        get => (string)GetValue(TextProperty);
        set => SetValue(TextProperty, value);
    }

    public static readonly DependencyProperty TextProperty = DependencyProperty.Register(nameof(Text), typeof(string), typeof(VoiceCommandTrigger), new PropertyMetadata(default(string), OnTextPropertyChanged));

    private static void OnTextPropertyChanged(DependencyObject d, DependencyPropertyChangedEventArgs e) {
        var source = d as VoiceCommandTrigger;
        if (source != null) {
            var newValue = (string)e.NewValue;
            var oldValue = (string)e.OldValue;
            if (!string.IsNullOrEmpty(oldValue)) {
                _triggers.Remove(oldValue);
            }
            if (!string.IsNullOrEmpty(newValue)) {
                _triggers[newValue] = source;
            }
        }
    }

    private static SpeechRecognizer _sr;
    private static readonly Dictionary<string, VoiceCommandTrigger> _triggers = new Dictionary<string, VoiceCommandTrigger>(StringComparer.InvariantCultureIgnoreCase);

    static VoiceCommandTrigger() {
        Task.Run(async () => {
            _sr = new SpeechRecognizer();
            _sr.ContinuousRecognitionSession.AutoStopSilenceTimeout = TimeSpan.MaxValue;
            await _sr.CompileConstraintsAsync();
            _sr.ContinuousRecognitionSession.ResultGenerated += ContinuousRecognitionSession_ResultGenerated;
            await _sr.ContinuousRecognitionSession.StartAsync();
        });
    }

    private static void ContinuousRecognitionSession_ResultGenerated(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionResultGeneratedEventArgs args) {
        Debug.WriteLine(args.Result.Text);
        if (_triggers.TryGetValue(args.Result.Text, out var trigger)) {
            _ = trigger.Dispatcher.RunAsync(CoreDispatcherPriority.Normal, () => {
                Interaction.ExecuteActions(trigger.AssociatedObject, trigger.Actions, args);
            });
        }
    }

    protected override void OnAttached() {
        base.OnAttached();
        _triggers[this.Text] = this;
    }

    protected override void OnDetaching() {
        if (_triggers[this.Text] == this) {
            _triggers.Remove(this.Text);
        }
    }
}

public class ClickAction : DependencyObject, IAction {

    public object Execute(object sender, object parameter) {

        if (sender is Button btn && btn.IsEnabled) {
            var peer = new ButtonAutomationPeer(btn);
            var invokeProv = peer.GetPattern(PatternInterface.Invoke) as IInvokeProvider;
            invokeProv?.Invoke();
        }

        return null;
    }
}

In the following XAML I have use the VoiceCommandTrigger.

<Button Content="Speak" Height="153" Margin="138,460,0,0" VerticalAlignment="Top" Width="420"
        Click="Button_Click">
    <Custom:Interaction.Behaviors>
        <local:VoiceCommandTrigger Text="Increase">
            <Custom1:ChangePropertyAction PropertyName="Width" Value="500" />
        </local:VoiceCommandTrigger>
        <local:VoiceCommandTrigger Text="Decrease">
            <local:ClickAction />
        </local:VoiceCommandTrigger>
    </Custom:Interaction.Behaviors>
</Button>

The used Button_Click method

private void Button_Click(object sender, RoutedEventArgs e) {
    (sender as Button).Width -= 100;
}

ghost commented 4 years ago

Hello, 'sonnemaf! Thanks for submitting a new feature request. I've automatically added a vote 👍 reaction to help get things started. Other community members can vote to help us prioritize this feature in the future!

Kyaa-dost commented 4 years ago

@sonnemaf Thanks for highlighting the feature and sharing the work. Let's see what our devs have to say on this one.

ptorr-msft commented 4 years ago

@sonnemaf , great idea. Do you (or anyone else reading the thread) currently make your apps accessible to screen readers etc. via UI Automation? If so, how do you think voice control would interact (or not) with those features?

michael-hawker commented 4 years ago

@sonnemaf just to clarify, this is using the system built-in SpeechRecognizer API? Do you know how this works with localization? Does the developer need to localize all the commands per language they want to support or does it kind of work off of English and transcribe at the system/API layer?

I do think I agree this is probably beyond the scope of contributing to the Behaviors package directly, even though they have a UWP package it really is just swapping base types compared to the WPF one, they only want generalized behaviors. So the toolkit with this being a UWP specific helper makes sense. Whether it's actually implemented as a Behavior or an Attached Property, I'm not sure. I think Attached Property could be easier for a developer to use, but depending on initialization/timing you may need a behavior to optimize loading? What did you find in your initial trials with this, is that why you implemented it as a Behavior?

However, I think @ptorr-msft posed a great question, I think overall voice commands would be a separate feature outside the standard UI Automation properties; but, it could be interesting to have a general helper that uses those existing properties to help automatically hook-up voice navigation? A developer would just hook this to their app/page as a service and it would do the rest? Maybe that's a larger scoped feature idea to do in addition to this??

sonnemaf commented 4 years ago

@michael-hawker It works with localization for the languages which support, but only for a few languages. Speech Recognition is available only for the following languages: English (United States, United Kingdom, Canada, India, and Australia), French, German, Japanese, Mandarin (Chinese Simplified and Chinese Traditional), and Spanish. Source

I have updated my VoiceCommandTrigger demo. It now supports English and German. It uses x:Uid for the Buttons and VoiceCommandTriggers.

I have published my demo app on https://github.com/sonnemaf/VoiceCommandsDemo. It also contains an improved version of the VoiceCommandTrigger. It initializes the SpeechRecognizer with the first supported language.

I use a Behavior (Trigger) and not an Attached Property because I think this is more flexible. It is nothing about timing. With behaviors you can assign multiple commands. Sometimes you might want different Texts for the same command. See example below.

<Button Click="ButtonSave_Click"
        Content="Save">
    <Button.VoiceCommands>
        <VoidCommand Text="Save" />
        <VoidCommand Text="Store it" />
    </Button.VoiceCommands>
</Button>

<!-- Attached Property -->
<Button Click="ButtonSave_Click"
        VoiceCommand.Text="Save"
        Content="Save"/>

Maybe you can solve this problem also with a separator. In the example below I used a pipe separator.

<Button Click="ButtonSave_Click"
        VoiceCommand.Text="Save|Store it"
        Content="Save"/>

With this Trigger you can assign one or more Actions to it. The Attached Property would only do a Click on a Button. This is I think the real advantage.

sonnemaf commented 4 years ago

@sonnemaf , great idea. Do you (or anyone else reading the thread) currently make your apps accessible to screen readers etc. via UI Automation? If so, how do you think voice control would interact (or not) with those features?

@ptorr-msft I'm ashamed that I haven't done this (yet). I can imagine that it should interact with those features. I'm only afraid that this would take ages to implement. An extra component in the toolkit is much faster to implement.

niels9001 commented 4 years ago

I must say that this is a really exciting feature request!. Voice interaction is becoming a common way of interacting with devices. The popularity of home speakers (e.g. Echo or Google Home) show that consumers accept and are capable of using systems this way - we see the same in the enterprise space. Having a standard way of using voice to interact with UI elements on the Windows/WinUI platform would be great.

Use cases

Accessibility is obvious: Microsoft invested a lot in making accessible tech: the gaze tracking support is a perfect example of this.

Input stack: with controller, keyboard, dial, mouse and gaze support in the XAML layer it would be great to have easy support for voice as well. Now this all needs to happen in code behind.

Healthcare: there are so many use cases in healthcare (and beyond that, in enterprise contexts where used aren't able to use both hands) that require 'no touch' interaction. Sterility is of the utmost importance in an operation theatre, while there are many situations where a nurse or physician is actually not capable of using a mouse, keyboard or touchscreen due to having their hands busy with e.g. treating a patient. Having a way to still control functions in applications by using voice would be huge win. The COVID19 crisis shows that, due to infection concerns, we will be moving to a way of interacting with devices with as less physical touch as possible.

Features

I think @sonnemaf showed some great examples on what could be possible in terms of XAML support. Adding voice commands to interactive controls would be perfect, as well as defining voice commands on a 'page' level (e.g. "Next screen").

Another big win would be around text entry: could we make TextBoxes voice capable on Focus? E.g., a user would tap, click (or with Gaze support, just LOOK at a TextBox) and could then use their voice to input data.

For some other examples that might be interesting, check out this blog post

1_06O5UIom5qdJnmxRcJzVmw

michael-hawker commented 4 years ago

Thanks @niels9001 for some great input and resources! 🦙❤

@sonnemaf you should be able to use an attached property too, that's what we do for the implicit animations in the toolkit, you just need a helper type to collect them as a list. Either way, seems like you've got a great start!

What would you propose our next steps be? Did you want to think about the API/use-cases more or start with implementing what you have as a base case in a PR?

sonnemaf commented 4 years ago

@michael-hawker I forgot the trick you used for collection on attached properties. 😊

I have fixed a problem in my repository with reactivating the app. It seems that you have to start listening again. It is not the most beautiful solution but it works.

Maybe we should also think about a What can I say features as described in these docs

Should a voice command also contain a minimum confidence value? The Action would then only be invoked if the RawConfidence is above this minimum.

I could create a PR already. In which assembly/project and namespace should I place the Trigger?

sonnemaf commented 4 years ago

I have updated my sample app. It now is a functional page. I even added a prototype of a 'What can I say' solution.

The VoiceCommandTrigger is currently using the UWP SpeechRecognizer class. I think this is wrong. It should allow to plug-in any speech recognizer solution. I will try to implement this in the next itteration.

michael-hawker commented 4 years ago

Hi @sonnemaf, sorry for the delay, missed question on where to put these.

We're thinking as part of our #3062 to move the Behaviors to their own Toolkit package. There's some work to be done still there. Let me look into that, and then we'll have a clear place to put this. 🙂

sonnemaf commented 4 years ago

Hi @michael-hawker, no worries I was very busy myself.

I have updated my sample project. The SpeechRecognition engine used by the trigger is now pluggable. This makes it way better. It really needs a review. That will come when I create the PR. Hope you find a nice place for this behavior. I think it is very cool.

jamesmcroft commented 3 years ago

@sonnemaf by attaching these behaviors to controls of a page, is the speech recognizer always listening?

If so, I'd be concerned that a regular user of an application wouldn't be keen on this. I'm completely on board with the idea of using voice to interact with UI, but I feel like it needs an activation keyword or action in order to start listening.

sonnemaf commented 3 years ago

@jamesmcroft thank you for this feedback. I think you are right. There should be an easy way to turn them on or off.

I have now added an IsEnabled property on the VoiceCommand trigger. In the SamplePage I have added a ToggleSwitch 'Voice'. If it is Off the buttons and listbox commands are not working.

The IsEnabled property of the VoiceCommandTrigger objects are databound to the IsOn property of the ToggleSwitch.

<Button x:Uid="ButtonAdd"
        Grid.Row="1"
        Grid.Column="1"
        HorizontalAlignment="Stretch"
        Click="ButtonAdd_Click"
        Content="&gt; Add &gt;">
    <Interactivity:Interaction.Behaviors>
        <Behaviors:VoiceCommandTrigger x:Uid="CommandAdd"
                                        IsEnabled="{x:Bind toggleListning.IsOn, Mode=OneWay}"
                                        Text="Add|at">
            <local:ClickAction />
        </Behaviors:VoiceCommandTrigger>
    </Interactivity:Interaction.Behaviors>
</Button>

In my sample the default is 'On' but that is something what the developer can choose.

Would this be enough?

niels9001 commented 3 years ago

Agree with @jamesmcroft, enabling a always-listening voice UI might not work in all situations. Especially when opening up an app while on the phone, you don't want your app to start doing things - so maybe it should be off by default?

A wake word would be nice - but on desktop, I can imagine that there are some other triggers that would be really useful as well, e.g. leveraging the keyboard.

Example: Press space bar down -> voice recognition is turned on and will stay on while holding down the space bar Release space bar -> voice recognition is turned off.

Awesome work @sonnemaf , really excited about this :)!

jamesmcroft commented 3 years ago

@sonnemaf I think this probably gives enough customization to allow a developer to make the choice on how to activate the voice commands.

I really like this feature! Looking forward to taking it out for a spin

sonnemaf commented 3 years ago

Thanks @niels9001 for this feedback. I think having an IsEnabled property should be enough. What to do with it is up to the user (developer). If they want to turn it on/off using the space bar they can.

What the default value should be is a good question. I think that true is the correct one, it will avoid a lot of support issues. What do others think?

niels9001 commented 3 years ago

@sonnemaf Yep agree!

I think the Gaze APIs that the Toolkit provide are turned off by default (I guess to avoid situation where the entire UI would be interactable with a non-explicit way of control) by default. Or, in situations where you only want to gaze (or voice) enable a specific (user)control instead of the entire page.

I could see that model working here as well.

CommunityToolkit / WindowsCommunityToolkit